Data Source
Food Prices for Nutrition DataHub: global statistics on the Cost and Affordability of Healthy Diets - WorldBank.org : https://www.worldbank.org/en/programs/icp/brief/foodpricesfornutrition?cq_ck=1646845586643#D_top
Exploring the realm of nutritional economics, we delve into the intricacies of food prices and their impact on the accessibility of a well-balanced diet.
Our central focus revolves around a key metric, the Percent of the population who cannot afford a healthy diet and is designated by the variable name [CoHD_headcount], herein referred to as the target variable. This variable is specifically characterized as the percentage of the population entangled in financial constraints impeding their capacity to procure a nutritionally sound diet. As we examine the statistical landscape, the average of this target variable stands at 37%.
Our main goal is to build a predictive model that categorizes individuals into two groups based on their financial constraints. This model is designed to discern and categorize individuals based on their affiliation with two distinctive groups: those who fall at or below the 37% threshold and those who exceed it.
This predictive framework holds substantial potential for informing strategies to enhance accessibility to healthy diets and, subsequently, contribute to the advancement of public health initiatives.
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from matplotlib.legend_handler import HandlerLine2D
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix
# from sklearn.metrics import plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
## Model 1 - Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Model 2 - Decision Trees
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
## Model 4 - Random Forest
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
## Model 5 - XGBoost
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
from sklearn.metrics import mean_squared_error
## Model 6 - Neural Networks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from kerastuner.tuners import GridSearch
from kerastuner import HyperParameters
warnings.filterwarnings('ignore')
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier, plot_importance
def coding(col, codeDict):
colCoded = pd.Series(col, copy=True)
for key, value in codeDict.items():
colCoded.replace(key, value, inplace=True)
return colCoded
### Model 6 - Neural Networks
from numpy.random import seed
import tensorflow
import sklearn
from matplotlib.pyplot import rcParams
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
rc={'figure.figsize': (15,10)})
import os
# from keras.wrappers.scikit_learn import KerasClassifier
from scikeras.wrappers import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
df = pd.read_csv('dataset2017.csv')
df.head()
| Classification Name | Classification Code | Country Name | Country Code | Time | Time Code | Population [Pop] | Millions of people who cannot afford a healthy diet [CoHD_unafford_n] | Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] | Millions of people who cannot afford sufficient calories [CoCA_unafford_n] | ... | Cost of oils and fats [CoHD_of] | Cost of legumes, nuts and seeds [CoHD_lns] | Cost of animal-source foods [CoHD_asf] | Cost of starchy staples [CoHD_ss] | Cost of vegetables [CoHD_v] | Cost of fruits [CoHD_f] | Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] | Cost of a healthy diet [CoHD] | Cost of a nutrient adequate diet [CoNA] | Cost of an energy sufficient diet [CoCA] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Food Prices for Nutrition 1.0 | FPN 1.0 | WORLD | WLD | 2017 | YR2017 | 7208180851 | 3049.1 | 2212.2 | 381.2 | ... | 0.13 | 0.348 | 0.873 | 0.508 | 0.793 | 0.662 | .. | 3.314 | 2.457 | 0.834 |
| 1 | Food Prices for Nutrition 1.0 | FPN 1.0 | Albania | ALB | 2017 | YR2017 | 2873457 | 1.1 | 0.4 | 0 | ... | 0.089 | 0.441 | 1.204 | 0.599 | 0.707 | 0.911 | 5.45 | 3.952 | 2.471 | 0.725 |
| 2 | Food Prices for Nutrition 1.0 | FPN 1.0 | Algeria | DZA | 2017 | YR2017 | 41389174 | 14.6 | 3 | 0.1 | ... | 0.132 | 0.493 | 0.964 | 0.496 | 0.699 | 0.979 | 4.87 | 3.763 | 2.323 | 0.773 |
| 3 | Food Prices for Nutrition 1.0 | FPN 1.0 | Angola | AGO | 2017 | YR2017 | 29816769 | 27.7 | 26 | 17 | ... | 0.166 | 0.395 | 1.011 | 0.839 | 1.199 | 0.717 | 3.08 | 4.327 | 3.231 | 1.403 |
| 4 | Food Prices for Nutrition 1.0 | FPN 1.0 | Anguilla | AIA | 2017 | YR2017 | .. | .. | .. | .. | ... | 0.134 | 0.328 | 0.647 | 0.778 | 1.019 | 0.811 | 2.84 | 3.717 | 2.433 | 1.308 |
5 rows × 40 columns
# Nombre d'observations et de variables
np.shape(df)
(744, 40)
df.info
<bound method DataFrame.info of Classification Name Classification Code Country Name \
0 Food Prices for Nutrition 1.0 FPN 1.0 WORLD
1 Food Prices for Nutrition 1.0 FPN 1.0 Albania
2 Food Prices for Nutrition 1.0 FPN 1.0 Algeria
3 Food Prices for Nutrition 1.0 FPN 1.0 Angola
4 Food Prices for Nutrition 1.0 FPN 1.0 Anguilla
.. ... ... ...
739 Food Prices for Nutrition 2.1 FPN 2.1 Uruguay
740 Food Prices for Nutrition 2.1 FPN 2.1 Vietnam
741 Food Prices for Nutrition 2.1 FPN 2.1 West Bank and Gaza
742 Food Prices for Nutrition 2.1 FPN 2.1 Zambia
743 Food Prices for Nutrition 2.1 FPN 2.1 Zimbabwe
Country Code Time Time Code Population [Pop] \
0 WLD 2017 YR2017 7208180851
1 ALB 2017 YR2017 2873457
2 DZA 2017 YR2017 41389174
3 AGO 2017 YR2017 29816769
4 AIA 2017 YR2017 ..
.. ... ... ... ...
739 URY 2017 YR2017 3422200
740 VNM 2017 YR2017 94033048
741 PSE 2017 YR2017 4454805
742 ZMB 2017 YR2017 17298054
743 ZWE 2017 YR2017 14751101
Millions of people who cannot afford a healthy diet [CoHD_unafford_n] \
0 3049.1
1 1.1
2 14.6
3 27.7
4 ..
.. ...
739 0.1
740 23.4
741 0.8
742 15.3
743 10
Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] \
0 2212.2
1 0.4
2 3
3 26
4 ..
.. ...
739 0
740 11
741 0.1
742 14.3
743 1.6
Millions of people who cannot afford sufficient calories [CoCA_unafford_n] \
0 381.2
1 0
2 0.1
3 17
4 ..
.. ...
739 0
740 0.7
741 0
742 11.2
743 0
... Cost of oils and fats [CoHD_of] \
0 ... 0.13
1 ... 0.089
2 ... 0.132
3 ... 0.166
4 ... 0.134
.. ... ...
739 ... 0.068
740 ... 0.189
741 ... 0.122
742 ... 0.13
743 ... 0.056
Cost of legumes, nuts and seeds [CoHD_lns] \
0 0.348
1 0.441
2 0.493
3 0.395
4 0.328
.. ...
739 0.487
740 0.305
741 0.186689715
742 0.232
743 0.128
Cost of animal-source foods [CoHD_asf] Cost of starchy staples [CoHD_ss] \
0 0.873 0.508
1 1.204 0.599
2 0.964 0.496
3 1.011 0.839
4 0.647 0.778
.. ... ...
739 0.615 0.378
740 1.183 0.582
741 1.127 0.649
742 0.906 0.814
743 0.368 0.213
Cost of vegetables [CoHD_v] Cost of fruits [CoHD_f] \
0 0.793 0.662
1 0.707 0.911
2 0.699 0.979
3 1.199 0.717
4 1.019 0.811
.. ... ...
739 0.989 0.537
740 0.771 0.556
741 0.564 0.695
742 0.631 0.372
743 0.375 0.199
Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] \
0 ..
1 5.45
2 4.87
3 3.08
4 2.84
.. ...
739 4.43
740 3.68
741 2.97
742 2.41
743 9.2
Cost of a healthy diet [CoHD] Cost of a nutrient adequate diet [CoNA] \
0 3.314 2.457
1 3.952 2.471
2 3.763 2.323
3 4.327 3.231
4 3.717 2.433
.. ... ...
739 3.073 2.129
740 3.586 2.514
741 3.342 1.761
742 3.085 2.380
743 2.199685 0.709
Cost of an energy sufficient diet [CoCA]
0 0.834
1 0.725
2 0.773
3 1.403
4 1.308
.. ...
739 0.694
740 0.975
741 1.124
742 1.279
743 0.239
[744 rows x 40 columns]>
print(df.dtypes)
Classification Name object Classification Code object Country Name object Country Code object Time int64 Time Code object Population [Pop] object Millions of people who cannot afford a healthy diet [CoHD_unafford_n] object Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] object Millions of people who cannot afford sufficient calories [CoCA_unafford_n] object Percent of the population who cannot afford a healthy diet [CoHD_headcount] object Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] object Percent of the population who cannot afford sufficient calories [CoCA_headcount] object Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] object Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] object Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] object Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov] object Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov] object Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov] object Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss] object Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss] object Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss] object Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss] object Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop] object Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss] object Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop] object Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop] object Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop] object Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop] object Cost share for fruits in a least-cost healthy diet [CoHD_f_prop] object Cost of oils and fats [CoHD_of] object Cost of legumes, nuts and seeds [CoHD_lns] object Cost of animal-source foods [CoHD_asf] object Cost of starchy staples [CoHD_ss] object Cost of vegetables [CoHD_v] object Cost of fruits [CoHD_f] object Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] object Cost of a healthy diet [CoHD] object Cost of a nutrient adequate diet [CoNA] float64 Cost of an energy sufficient diet [CoCA] float64 dtype: object
df.describe()
| Time | Cost of a nutrient adequate diet [CoNA] | Cost of an energy sufficient diet [CoCA] | |
|---|---|---|---|
| count | 744.0 | 744.000000 | 744.000000 |
| mean | 2017.0 | 2.453239 | 0.833306 |
| std | 0.0 | 0.647819 | 0.355677 |
| min | 2017.0 | 0.709000 | 0.239000 |
| 25% | 2017.0 | 2.028250 | 0.608000 |
| 50% | 2017.0 | 2.353000 | 0.801000 |
| 75% | 2017.0 | 2.686000 | 1.009000 |
| max | 2017.0 | 5.866000 | 2.958000 |
modalities = np.unique(df['Classification Code'])
print("Modalities:", modalities)
Modalities: ['FPN 1.0' 'FPN 1.1' 'FPN 2.0' 'FPN 2.1']
Our data set contains 744 observations characterized by 40 features. Out of these 40 variables, 1 is an integer (Time), 2 are floats (CoNAand CoCA) and the rest are object types.
Moreover, after this very first exploration, we decide that 4 variables do not seem to bring any important information, therefore we decide to delete them from our data set: Time, Time Code, Classification Name and Country Code.
print(f"Number of columns before deletion : {df.shape[1]}")
del_cols = ['Classification Name', 'Time', 'Country Code', 'Time Code',]
df.drop(labels=del_cols, axis=1, inplace=True)
print(f"Number of columns before deletion : {df.shape[1]}")
Number of columns before deletion : 40 Number of columns before deletion : 36
df.head()
| Classification Code | Country Name | Population [Pop] | Millions of people who cannot afford a healthy diet [CoHD_unafford_n] | Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] | Millions of people who cannot afford sufficient calories [CoCA_unafford_n] | Percent of the population who cannot afford a healthy diet [CoHD_headcount] | Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] | Percent of the population who cannot afford sufficient calories [CoCA_headcount] | Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] | ... | Cost of oils and fats [CoHD_of] | Cost of legumes, nuts and seeds [CoHD_lns] | Cost of animal-source foods [CoHD_asf] | Cost of starchy staples [CoHD_ss] | Cost of vegetables [CoHD_v] | Cost of fruits [CoHD_f] | Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] | Cost of a healthy diet [CoHD] | Cost of a nutrient adequate diet [CoNA] | Cost of an energy sufficient diet [CoCA] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FPN 1.0 | WORLD | 7208180851 | 3049.1 | 2212.2 | 381.2 | 42.9 | 31.1 | 5.4 | .. | ... | 0.13 | 0.348 | 0.873 | 0.508 | 0.793 | 0.662 | .. | 3.314 | 2.457 | 0.834 |
| 1 | FPN 1.0 | Albania | 2873457 | 1.1 | 0.4 | 0 | 37.8 | 13 | 0 | 0.425 | ... | 0.089 | 0.441 | 1.204 | 0.599 | 0.707 | 0.911 | 5.45 | 3.952 | 2.471 | 0.725 |
| 2 | FPN 1.0 | Algeria | 41389174 | 14.6 | 3 | 0.1 | 35.2 | 7.2 | 0.2 | 0.605 | ... | 0.132 | 0.493 | 0.964 | 0.496 | 0.699 | 0.979 | 4.87 | 3.763 | 2.323 | 0.773 |
| 3 | FPN 1.0 | Angola | 29816769 | 27.7 | 26 | 17 | 92.9 | 87.1 | 57.2 | 0.972 | ... | 0.166 | 0.395 | 1.011 | 0.839 | 1.199 | 0.717 | 3.08 | 4.327 | 3.231 | 1.403 |
| 4 | FPN 1.0 | Anguilla | .. | .. | .. | .. | .. | .. | .. | 0.577 | ... | 0.134 | 0.328 | 0.647 | 0.778 | 1.019 | 0.811 | 2.84 | 3.717 | 2.433 | 1.308 |
5 rows × 36 columns
To allow for a better analysis of the variables, we have decided to transform the object type variables into float, as they all expressed either percentages or ratios. The only 2 variables left unchanged (with their initial type, object) are Country Name and Classification Code.
print(df.columns)
Index(['Classification Code', 'Country Name', 'Population [Pop]',
'Millions of people who cannot afford a healthy diet [CoHD_unafford_n]',
'Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n]',
'Millions of people who cannot afford sufficient calories [CoCA_unafford_n]',
'Percent of the population who cannot afford a healthy diet [CoHD_headcount]',
'Percent of the population who cannot afford nutrient adequacy [CoNA_headcount]',
'Percent of the population who cannot afford sufficient calories [CoCA_headcount]',
'Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp]',
'Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp]',
'Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp]',
'Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov]',
'Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov]',
'Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov]',
'Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss]',
'Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss]',
'Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss]',
'Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss]',
'Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop]',
'Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss]',
'Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop]',
'Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop]',
'Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop]',
'Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop]',
'Cost share for fruits in a least-cost healthy diet [CoHD_f_prop]',
'Cost of oils and fats [CoHD_of]',
'Cost of legumes, nuts and seeds [CoHD_lns]',
'Cost of animal-source foods [CoHD_asf]',
'Cost of starchy staples [CoHD_ss]', 'Cost of vegetables [CoHD_v]',
'Cost of fruits [CoHD_f]',
'Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA]',
'Cost of a healthy diet [CoHD]',
'Cost of a nutrient adequate diet [CoNA]',
'Cost of an energy sufficient diet [CoCA]'],
dtype='object')
columns_to_transform = [
'Population [Pop]',
'Millions of people who cannot afford a healthy diet [CoHD_unafford_n]',
'Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n]',
'Millions of people who cannot afford sufficient calories [CoCA_unafford_n]',
'Percent of the population who cannot afford a healthy diet [CoHD_headcount]',
'Percent of the population who cannot afford nutrient adequacy [CoNA_headcount]',
'Percent of the population who cannot afford sufficient calories [CoCA_headcount]',
'Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp]',
'Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp]',
'Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp]',
'Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov]',
'Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov]',
'Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov]',
'Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss]',
'Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss]',
'Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss]',
'Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss]',
'Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop]',
'Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss]',
'Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop]',
'Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop]',
'Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop]',
'Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop]',
'Cost share for fruits in a least-cost healthy diet [CoHD_f_prop]',
'Cost of oils and fats [CoHD_of]',
'Cost of legumes, nuts and seeds [CoHD_lns]',
'Cost of animal-source foods [CoHD_asf]',
'Cost of starchy staples [CoHD_ss]', 'Cost of vegetables [CoHD_v]',
'Cost of fruits [CoHD_f]',
'Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA]',
'Cost of a healthy diet [CoHD]',
'Cost of a nutrient adequate diet [CoNA]',
'Cost of an energy sufficient diet [CoCA]'
]
df[columns_to_transform] = df[columns_to_transform].apply(pd.to_numeric, errors='coerce')
df[columns_to_transform] = df[columns_to_transform].astype(float)
print(df)
Classification Code Country Name Population [Pop] \
0 FPN 1.0 WORLD 7.208181e+09
1 FPN 1.0 Albania 2.873457e+06
2 FPN 1.0 Algeria 4.138917e+07
3 FPN 1.0 Angola 2.981677e+07
4 FPN 1.0 Anguilla NaN
.. ... ... ...
739 FPN 2.1 Uruguay 3.422200e+06
740 FPN 2.1 Vietnam 9.403305e+07
741 FPN 2.1 West Bank and Gaza 4.454805e+06
742 FPN 2.1 Zambia 1.729805e+07
743 FPN 2.1 Zimbabwe 1.475110e+07
Millions of people who cannot afford a healthy diet [CoHD_unafford_n] \
0 3049.1
1 1.1
2 14.6
3 27.7
4 NaN
.. ...
739 0.1
740 23.4
741 0.8
742 15.3
743 10.0
Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] \
0 2212.2
1 0.4
2 3.0
3 26.0
4 NaN
.. ...
739 0.0
740 11.0
741 0.1
742 14.3
743 1.6
Millions of people who cannot afford sufficient calories [CoCA_unafford_n] \
0 381.2
1 0.0
2 0.1
3 17.0
4 NaN
.. ...
739 0.0
740 0.7
741 0.0
742 11.2
743 0.0
Percent of the population who cannot afford a healthy diet [CoHD_headcount] \
0 42.9
1 37.8
2 35.2
3 92.9
4 NaN
.. ...
739 2.8
740 24.9
741 18.0
742 88.7
743 67.8
Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] \
0 31.1
1 13.0
2 7.2
3 87.1
4 NaN
.. ...
739 0.8
740 11.7
741 2.4
742 83.0
743 10.6
Percent of the population who cannot afford sufficient calories [CoCA_headcount] \
0 5.4
1 0.0
2 0.2
3 57.2
4 NaN
.. ...
739 0.0
740 0.7
741 0.6
742 64.9
743 0.0
Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] \
0 NaN
1 0.425
2 0.605
3 0.972
4 0.577
.. ...
739 0.416
740 1.052
741 0.845
742 1.821
743 0.973
... Cost of oils and fats [CoHD_of] \
0 ... 0.130
1 ... 0.089
2 ... 0.132
3 ... 0.166
4 ... 0.134
.. ... ...
739 ... 0.068
740 ... 0.189
741 ... 0.122
742 ... 0.130
743 ... 0.056
Cost of legumes, nuts and seeds [CoHD_lns] \
0 0.34800
1 0.44100
2 0.49300
3 0.39500
4 0.32800
.. ...
739 0.48700
740 0.30500
741 0.18669
742 0.23200
743 0.12800
Cost of animal-source foods [CoHD_asf] \
0 0.873
1 1.204
2 0.964
3 1.011
4 0.647
.. ...
739 0.615
740 1.183
741 1.127
742 0.906
743 0.368
Cost of starchy staples [CoHD_ss] Cost of vegetables [CoHD_v] \
0 0.508 0.793
1 0.599 0.707
2 0.496 0.699
3 0.839 1.199
4 0.778 1.019
.. ... ...
739 0.378 0.989
740 0.582 0.771
741 0.649 0.564
742 0.814 0.631
743 0.213 0.375
Cost of fruits [CoHD_f] \
0 0.662
1 0.911
2 0.979
3 0.717
4 0.811
.. ...
739 0.537
740 0.556
741 0.695
742 0.372
743 0.199
Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] \
0 NaN
1 5.45
2 4.87
3 3.08
4 2.84
.. ...
739 4.43
740 3.68
741 2.97
742 2.41
743 9.20
Cost of a healthy diet [CoHD] Cost of a nutrient adequate diet [CoNA] \
0 3.314000 2.457
1 3.952000 2.471
2 3.763000 2.323
3 4.327000 3.231
4 3.717000 2.433
.. ... ...
739 3.073000 2.129
740 3.586000 2.514
741 3.342000 1.761
742 3.085000 2.380
743 2.199685 0.709
Cost of an energy sufficient diet [CoCA]
0 0.834
1 0.725
2 0.773
3 1.403
4 1.308
.. ...
739 0.694
740 0.975
741 1.124
742 1.279
743 0.239
[744 rows x 36 columns]
df.describe()
| Population [Pop] | Millions of people who cannot afford a healthy diet [CoHD_unafford_n] | Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] | Millions of people who cannot afford sufficient calories [CoCA_unafford_n] | Percent of the population who cannot afford a healthy diet [CoHD_headcount] | Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] | Percent of the population who cannot afford sufficient calories [CoCA_headcount] | Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] | Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] | Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] | ... | Cost of oils and fats [CoHD_of] | Cost of legumes, nuts and seeds [CoHD_lns] | Cost of animal-source foods [CoHD_asf] | Cost of starchy staples [CoHD_ss] | Cost of vegetables [CoHD_v] | Cost of fruits [CoHD_f] | Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] | Cost of a healthy diet [CoHD] | Cost of a nutrient adequate diet [CoNA] | Cost of an energy sufficient diet [CoCA] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.320000e+02 | 620.000000 | 620.000000 | 620.000000 | 620.000000 | 620.000000 | 620.000000 | 700.000000 | 720.000000 | 720.000000 | ... | 724.000000 | 724.000000 | 724.000000 | 724.000000 | 724.000000 | 724.000000 | 704.000000 | 724.000000 | 744.000000 | 744.000000 |
| mean | 1.584458e+08 | 80.211613 | 58.572258 | 9.243226 | 37.424355 | 27.531774 | 6.933065 | 0.808629 | 0.591792 | 0.212583 | ... | 0.130354 | 0.346642 | 0.872105 | 0.507710 | 0.791588 | 0.660972 | 4.530318 | 3.311839 | 2.453239 | 0.833306 |
| std | 6.630139e+08 | 344.723374 | 254.009743 | 41.184856 | 34.572739 | 29.159310 | 13.462474 | 0.656850 | 0.446859 | 0.196555 | ... | 0.052294 | 0.146032 | 0.210474 | 0.185818 | 0.269550 | 0.242642 | 1.939288 | 0.638392 | 0.647819 | 0.355677 |
| min | 2.956700e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.247000 | 0.167000 | 0.032000 | ... | 0.052000 | 0.069000 | 0.368000 | 0.147000 | 0.303000 | 0.162000 | 1.124543 | 1.822000 | 0.709000 | 0.239000 |
| 25% | 2.381182e+06 | 0.200000 | 0.100000 | 0.000000 | 3.000000 | 1.400000 | 0.100000 | 0.394000 | 0.311000 | 0.087000 | ... | 0.093000 | 0.256000 | 0.704000 | 0.401000 | 0.605000 | 0.504000 | 3.277500 | 2.863000 | 2.028250 | 0.608000 |
| 50% | 9.854033e+06 | 2.100000 | 1.450000 | 0.100000 | 24.800000 | 15.500000 | 0.900000 | 0.640000 | 0.452000 | 0.162000 | ... | 0.129000 | 0.331000 | 0.859000 | 0.496000 | 0.754000 | 0.634000 | 4.060000 | 3.252000 | 2.353000 | 0.801000 |
| 75% | 4.012708e+07 | 16.425000 | 11.575000 | 2.000000 | 75.000000 | 54.600000 | 6.975000 | 0.950000 | 0.731000 | 0.258000 | ... | 0.160000 | 0.409000 | 1.011000 | 0.587000 | 0.917000 | 0.790000 | 5.460000 | 3.622000 | 2.686000 | 1.009000 |
| max | 7.262507e+09 | 3133.400000 | 2293.800000 | 381.200000 | 97.500000 | 96.000000 | 78.500000 | 5.272000 | 3.913000 | 1.440000 | ... | 0.453000 | 0.875000 | 1.583000 | 1.477000 | 1.785000 | 1.997000 | 10.880000 | 5.975000 | 5.866000 | 2.958000 |
8 rows × 34 columns
The next step in our data exploration is the identification of missing values.
Missing values, or missing data, occur when information is not available or not recorded for specific variables or observations.
df.isnull().sum()
Classification Code 0 Country Name 0 Population [Pop] 12 Millions of people who cannot afford a healthy diet [CoHD_unafford_n] 124 Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] 124 Millions of people who cannot afford sufficient calories [CoCA_unafford_n] 124 Percent of the population who cannot afford a healthy diet [CoHD_headcount] 124 Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] 124 Percent of the population who cannot afford sufficient calories [CoCA_headcount] 124 Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] 44 Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] 24 Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] 24 Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov] 44 Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov] 24 Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov] 24 Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss] 44 Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss] 44 Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss] 44 Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss] 44 Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop] 44 Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss] 44 Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop] 44 Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop] 44 Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop] 44 Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop] 44 Cost share for fruits in a least-cost healthy diet [CoHD_f_prop] 44 Cost of oils and fats [CoHD_of] 20 Cost of legumes, nuts and seeds [CoHD_lns] 20 Cost of animal-source foods [CoHD_asf] 20 Cost of starchy staples [CoHD_ss] 20 Cost of vegetables [CoHD_v] 20 Cost of fruits [CoHD_f] 20 Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] 40 Cost of a healthy diet [CoHD] 20 Cost of a nutrient adequate diet [CoNA] 0 Cost of an energy sufficient diet [CoCA] 0 dtype: int64
# Replacement of missing values with the mean value of each variable
import pandas as pd
float_columns = df.select_dtypes(include='float').columns
df[float_columns] = df[float_columns].fillna(df[float_columns].mean())
print(df)
Classification Code Country Name Population [Pop] \
0 FPN 1.0 WORLD 7.208181e+09
1 FPN 1.0 Albania 2.873457e+06
2 FPN 1.0 Algeria 4.138917e+07
3 FPN 1.0 Angola 2.981677e+07
4 FPN 1.0 Anguilla 1.584458e+08
.. ... ... ...
739 FPN 2.1 Uruguay 3.422200e+06
740 FPN 2.1 Vietnam 9.403305e+07
741 FPN 2.1 West Bank and Gaza 4.454805e+06
742 FPN 2.1 Zambia 1.729805e+07
743 FPN 2.1 Zimbabwe 1.475110e+07
Millions of people who cannot afford a healthy diet [CoHD_unafford_n] \
0 3049.100000
1 1.100000
2 14.600000
3 27.700000
4 80.211613
.. ...
739 0.100000
740 23.400000
741 0.800000
742 15.300000
743 10.000000
Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] \
0 2212.200000
1 0.400000
2 3.000000
3 26.000000
4 58.572258
.. ...
739 0.000000
740 11.000000
741 0.100000
742 14.300000
743 1.600000
Millions of people who cannot afford sufficient calories [CoCA_unafford_n] \
0 381.200000
1 0.000000
2 0.100000
3 17.000000
4 9.243226
.. ...
739 0.000000
740 0.700000
741 0.000000
742 11.200000
743 0.000000
Percent of the population who cannot afford a healthy diet [CoHD_headcount] \
0 42.900000
1 37.800000
2 35.200000
3 92.900000
4 37.424355
.. ...
739 2.800000
740 24.900000
741 18.000000
742 88.700000
743 67.800000
Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] \
0 31.100000
1 13.000000
2 7.200000
3 87.100000
4 27.531774
.. ...
739 0.800000
740 11.700000
741 2.400000
742 83.000000
743 10.600000
Percent of the population who cannot afford sufficient calories [CoCA_headcount] \
0 5.400000
1 0.000000
2 0.200000
3 57.200000
4 6.933065
.. ...
739 0.000000
740 0.700000
741 0.600000
742 64.900000
743 0.000000
Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] \
0 0.808629
1 0.425000
2 0.605000
3 0.972000
4 0.577000
.. ...
739 0.416000
740 1.052000
741 0.845000
742 1.821000
743 0.973000
... Cost of oils and fats [CoHD_of] \
0 ... 0.130
1 ... 0.089
2 ... 0.132
3 ... 0.166
4 ... 0.134
.. ... ...
739 ... 0.068
740 ... 0.189
741 ... 0.122
742 ... 0.130
743 ... 0.056
Cost of legumes, nuts and seeds [CoHD_lns] \
0 0.34800
1 0.44100
2 0.49300
3 0.39500
4 0.32800
.. ...
739 0.48700
740 0.30500
741 0.18669
742 0.23200
743 0.12800
Cost of animal-source foods [CoHD_asf] \
0 0.873
1 1.204
2 0.964
3 1.011
4 0.647
.. ...
739 0.615
740 1.183
741 1.127
742 0.906
743 0.368
Cost of starchy staples [CoHD_ss] Cost of vegetables [CoHD_v] \
0 0.508 0.793
1 0.599 0.707
2 0.496 0.699
3 0.839 1.199
4 0.778 1.019
.. ... ...
739 0.378 0.989
740 0.582 0.771
741 0.649 0.564
742 0.814 0.631
743 0.213 0.375
Cost of fruits [CoHD_f] \
0 0.662
1 0.911
2 0.979
3 0.717
4 0.811
.. ...
739 0.537
740 0.556
741 0.695
742 0.372
743 0.199
Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] \
0 4.530318
1 5.450000
2 4.870000
3 3.080000
4 2.840000
.. ...
739 4.430000
740 3.680000
741 2.970000
742 2.410000
743 9.200000
Cost of a healthy diet [CoHD] Cost of a nutrient adequate diet [CoNA] \
0 3.314000 2.457
1 3.952000 2.471
2 3.763000 2.323
3 4.327000 3.231
4 3.717000 2.433
.. ... ...
739 3.073000 2.129
740 3.586000 2.514
741 3.342000 1.761
742 3.085000 2.380
743 2.199685 0.709
Cost of an energy sufficient diet [CoCA]
0 0.834
1 0.725
2 0.773
3 1.403
4 1.308
.. ...
739 0.694
740 0.975
741 1.124
742 1.279
743 0.239
[744 rows x 36 columns]
df.isnull().sum()
Classification Code 0 Country Name 0 Population [Pop] 0 Millions of people who cannot afford a healthy diet [CoHD_unafford_n] 0 Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] 0 Millions of people who cannot afford sufficient calories [CoCA_unafford_n] 0 Percent of the population who cannot afford a healthy diet [CoHD_headcount] 0 Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] 0 Percent of the population who cannot afford sufficient calories [CoCA_headcount] 0 Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] 0 Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] 0 Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] 0 Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov] 0 Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov] 0 Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov] 0 Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss] 0 Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss] 0 Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss] 0 Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss] 0 Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop] 0 Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss] 0 Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop] 0 Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop] 0 Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop] 0 Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop] 0 Cost share for fruits in a least-cost healthy diet [CoHD_f_prop] 0 Cost of oils and fats [CoHD_of] 0 Cost of legumes, nuts and seeds [CoHD_lns] 0 Cost of animal-source foods [CoHD_asf] 0 Cost of starchy staples [CoHD_ss] 0 Cost of vegetables [CoHD_v] 0 Cost of fruits [CoHD_f] 0 Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] 0 Cost of a healthy diet [CoHD] 0 Cost of a nutrient adequate diet [CoNA] 0 Cost of an energy sufficient diet [CoCA] 0 dtype: int64
The next step in our data exploration is the identification of outliers.
An outlier, or an extreme value, is a data value that is very different from most of the data. Given the nature of our data set, we assume that none of the variables present any outlier.
df.describe()
| Population [Pop] | Millions of people who cannot afford a healthy diet [CoHD_unafford_n] | Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n] | Millions of people who cannot afford sufficient calories [CoCA_unafford_n] | Percent of the population who cannot afford a healthy diet [CoHD_headcount] | Percent of the population who cannot afford nutrient adequacy [CoNA_headcount] | Percent of the population who cannot afford sufficient calories [CoCA_headcount] | Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp] | Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp] | Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp] | ... | Cost of oils and fats [CoHD_of] | Cost of legumes, nuts and seeds [CoHD_lns] | Cost of animal-source foods [CoHD_asf] | Cost of starchy staples [CoHD_ss] | Cost of vegetables [CoHD_v] | Cost of fruits [CoHD_f] | Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA] | Cost of a healthy diet [CoHD] | Cost of a nutrient adequate diet [CoNA] | Cost of an energy sufficient diet [CoCA] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.440000e+02 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | ... | 744.000000 | 744.000000 | 744.000000 | 744.00000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 |
| mean | 1.584458e+08 | 80.211613 | 58.572258 | 9.243226 | 37.424355 | 27.531774 | 6.933065 | 0.808629 | 0.591792 | 0.212583 | ... | 0.130354 | 0.346642 | 0.872105 | 0.50771 | 0.791588 | 0.660972 | 4.530318 | 3.311839 | 2.453239 | 0.833306 |
| std | 6.576380e+08 | 314.645590 | 231.846900 | 37.591397 | 31.556200 | 26.615104 | 12.287847 | 0.637104 | 0.439582 | 0.193355 | ... | 0.051585 | 0.144053 | 0.207621 | 0.18330 | 0.265897 | 0.239354 | 1.886365 | 0.629742 | 0.647819 | 0.355677 |
| min | 2.956700e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.247000 | 0.167000 | 0.032000 | ... | 0.052000 | 0.069000 | 0.368000 | 0.14700 | 0.303000 | 0.162000 | 1.124543 | 1.822000 | 0.709000 | 0.239000 |
| 25% | 2.401840e+06 | 0.400000 | 0.200000 | 0.000000 | 4.000000 | 1.900000 | 0.200000 | 0.416750 | 0.313750 | 0.089500 | ... | 0.094750 | 0.257000 | 0.717500 | 0.40575 | 0.615250 | 0.518000 | 3.320000 | 2.871250 | 2.028250 | 0.608000 |
| 50% | 1.005770e+07 | 5.300000 | 3.950000 | 0.400000 | 37.424355 | 26.350000 | 2.200000 | 0.650000 | 0.461000 | 0.170000 | ... | 0.129500 | 0.333500 | 0.864000 | 0.49700 | 0.757000 | 0.640000 | 4.160000 | 3.273000 | 2.353000 | 0.801000 |
| 75% | 4.114406e+07 | 80.211613 | 58.572258 | 9.243226 | 65.725000 | 45.950000 | 6.933065 | 0.933500 | 0.726750 | 0.256250 | ... | 0.160000 | 0.407000 | 1.006000 | 0.58500 | 0.910750 | 0.786000 | 5.425000 | 3.619000 | 2.686000 | 1.009000 |
| max | 7.262507e+09 | 3133.400000 | 2293.800000 | 381.200000 | 97.500000 | 96.000000 | 78.500000 | 5.272000 | 3.913000 | 1.440000 | ... | 0.453000 | 0.875000 | 1.583000 | 1.47700 | 1.785000 | 1.997000 | 10.880000 | 5.975000 | 5.866000 | 2.958000 |
8 rows × 34 columns
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
float_columns = df.select_dtypes(include='float64').columns
for column in float_columns:
plt.figure(figsize=(5, 3))
sns.boxplot(x=df[column])
plt.title(f'Boxplot for {column}')
plt.show()
# Dictionary of correspondence between old and new column names
correspondance_noms = {
'Country Name': 'country',
'Classification Code': 'classif',
'Time' : 'year',
'Population [Pop]' : 'pop',
'Millions of people who cannot afford a healthy diet [CoHD_unafford_n]' : 'nb_notaff_healthy',
'Millions of people who cannot afford nutrient adequacy [CoNA_unafford_n]' : 'nb_notaff_nutrient',
'Millions of people who cannot afford sufficient calories [CoCA_unafford_n]' : 'nb_notaff_calories',
'Percent of the population who cannot afford a healthy diet [CoHD_headcount]' : 'per_notaff_healthy',
'Percent of the population who cannot afford nutrient adequacy [CoNA_headcount]' : 'per_notaff_nutrient',
'Percent of the population who cannot afford sufficient calories [CoCA_headcount]' : 'per_notaff_calories',
'Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp]' : 'aff_healthy_cost_fexp',
'Affordability of a nutrient adequate diet: ratio of cost to food expenditures [CoNA_fexp]' : 'aff_nutrient_cost_fexp',
'Affordability of an energy sufficient diet: ratio of cost to food expenditures [CoCA_fexp]' : 'aff_calories_cost_fexp',
'Affordability of a healthy diet: ratio of cost to the food poverty line [CoHD_pov]' : 'aff_healthy_cost_pov',
'Affordability of a nutrient adequate diet: ratio of cost to the food poverty line [CoNA_pov]' : 'aff_nutrient_cost_pov',
'Affordability of an energy sufficient diet: ratio of cost to the food poverty line [CoCA_pov]' : 'aff_calories_cost_pov',
'Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss]' : 'cost_of_ss',
'Cost of legumes, nuts and seeds relative to the starchy staples in a least-cost healthy diet [CoHD_lns_ss]' : 'cost_lns_ss',
'Cost of animal-sourced foods relative to the starchy staples in a least-cost healthy diet [CoHD_asf_ss]' : 'cost_asf_ss',
'Cost of vegetables relative to the starchy staples in a least-cost healthy diet [CoHD_v_ss]' : 'cost_v_ss',
'Cost share for oils and fats in a least-cost healthy diet [CoHD_of_prop]' : 'cost_of_healthy',
'Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss]' : 'cost_f_ss',
'Cost share for legumes, nuts and seeds in a least-cost healthy diet [CoHD_lns_prop]' : 'cost_lns_healthy',
'Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop]' : 'cost_asf_healthy',
'Cost share for starchy staples in a least-cost healthy diet [CoHD_ss_prop]' : 'cost_ss_healthy',
'Cost share for vegetables in a least-cost healthy diet [CoHD_v_prop]' : 'cost_v_healthy',
'Cost share for fruits in a least-cost healthy diet [CoHD_f_prop]' : 'cost_f_healthy',
'Cost of oils and fats [CoHD_of]' : 'cost_of',
'Cost of legumes, nuts and seeds [CoHD_lns]' : 'cost_lns',
'Cost of animal-source foods [CoHD_asf]' : 'cost_asf',
'Cost of starchy staples [CoHD_ss]' : 'cost_ss',
'Cost of vegetables [CoHD_v]' : 'cost_v',
'Cost of fruits [CoHD_f]' : 'cost_f',
'Cost of a healthy diet relative to the cost of sufficient energy from starchy staples [CoHD_CoCA]' : 'cost_healthy_ss',
'Cost of a healthy diet [CoHD]' : 'cost_healthy',
'Cost of a nutrient adequate diet [CoNA]' : 'cost_nutrient',
'Cost of an energy sufficient diet [CoCA]' : 'cost_energy'
}
df.rename(columns=correspondance_noms, inplace=True)
print(df)
classif country pop nb_notaff_healthy \
0 FPN 1.0 WORLD 7.208181e+09 3049.100000
1 FPN 1.0 Albania 2.873457e+06 1.100000
2 FPN 1.0 Algeria 4.138917e+07 14.600000
3 FPN 1.0 Angola 2.981677e+07 27.700000
4 FPN 1.0 Anguilla 1.584458e+08 80.211613
.. ... ... ... ...
739 FPN 2.1 Uruguay 3.422200e+06 0.100000
740 FPN 2.1 Vietnam 9.403305e+07 23.400000
741 FPN 2.1 West Bank and Gaza 4.454805e+06 0.800000
742 FPN 2.1 Zambia 1.729805e+07 15.300000
743 FPN 2.1 Zimbabwe 1.475110e+07 10.000000
nb_notaff_nutrient nb_notaff_calories per_notaff_healthy \
0 2212.200000 381.200000 42.900000
1 0.400000 0.000000 37.800000
2 3.000000 0.100000 35.200000
3 26.000000 17.000000 92.900000
4 58.572258 9.243226 37.424355
.. ... ... ...
739 0.000000 0.000000 2.800000
740 11.000000 0.700000 24.900000
741 0.100000 0.000000 18.000000
742 14.300000 11.200000 88.700000
743 1.600000 0.000000 67.800000
per_notaff_nutrient per_notaff_calories aff_healthy_cost_fexp ... \
0 31.100000 5.400000 0.808629 ...
1 13.000000 0.000000 0.425000 ...
2 7.200000 0.200000 0.605000 ...
3 87.100000 57.200000 0.972000 ...
4 27.531774 6.933065 0.577000 ...
.. ... ... ... ...
739 0.800000 0.000000 0.416000 ...
740 11.700000 0.700000 1.052000 ...
741 2.400000 0.600000 0.845000 ...
742 83.000000 64.900000 1.821000 ...
743 10.600000 0.000000 0.973000 ...
cost_of cost_lns cost_asf cost_ss cost_v cost_f cost_healthy_ss \
0 0.130 0.34800 0.873 0.508 0.793 0.662 4.530318
1 0.089 0.44100 1.204 0.599 0.707 0.911 5.450000
2 0.132 0.49300 0.964 0.496 0.699 0.979 4.870000
3 0.166 0.39500 1.011 0.839 1.199 0.717 3.080000
4 0.134 0.32800 0.647 0.778 1.019 0.811 2.840000
.. ... ... ... ... ... ... ...
739 0.068 0.48700 0.615 0.378 0.989 0.537 4.430000
740 0.189 0.30500 1.183 0.582 0.771 0.556 3.680000
741 0.122 0.18669 1.127 0.649 0.564 0.695 2.970000
742 0.130 0.23200 0.906 0.814 0.631 0.372 2.410000
743 0.056 0.12800 0.368 0.213 0.375 0.199 9.200000
cost_healthy cost_nutrient cost_energy
0 3.314000 2.457 0.834
1 3.952000 2.471 0.725
2 3.763000 2.323 0.773
3 4.327000 3.231 1.403
4 3.717000 2.433 1.308
.. ... ... ...
739 3.073000 2.129 0.694
740 3.586000 2.514 0.975
741 3.342000 1.761 1.124
742 3.085000 2.380 1.279
743 2.199685 0.709 0.239
[744 rows x 36 columns]
We will now perform the descriptive statistics for a few relevant variables :
target_variable = 'per_notaff_healthy'
# Visualisation
sns.histplot(df[target_variable], kde=True)
plt.title('Distribution of ' + target_variable)
plt.show()
df.hist(bins=50, figsize=(20,15))
plt.show()
relevant_variables = ['per_notaff_healthy',
'aff_healthy_cost_fexp',
'cost_of_ss',
'cost_f_healthy',
'cost_asf_healthy']
# Univariate statistics for each variable
descriptives_stat = pd.DataFrame()
for variable in relevant_variables:
descriptives_stat[variable] = df[variable].describe()
print(descriptives_stat)
per_notaff_healthy aff_healthy_cost_fexp cost_of_ss cost_f_healthy \
count 744.000000 744.000000 744.000000 744.000000
mean 37.424355 0.808629 0.271229 0.199217
std 31.556200 0.637104 0.101297 0.053770
min 0.000000 0.247000 0.060000 0.065000
25% 4.000000 0.416750 0.200000 0.163750
50% 37.424355 0.650000 0.260000 0.199217
75% 65.725000 0.933500 0.322500 0.235000
max 97.500000 5.272000 0.790000 0.423000
cost_asf_healthy
count 744.000000
mean 0.265811
std 0.053360
min 0.123000
25% 0.231000
50% 0.265811
75% 0.299250
max 0.413000
The mean percentage of the population unable to afford a healthy diet is approximately 37.42%, with a notable standard deviation of 31.56%, suggesting substantial variability in the affordability across nations. Regarding the affordability ratio of a healthy diet to total food expenditure, the mean is approximately 0.81, indicating that, on average, a healthy diet is estimated to be affordable in most countries.
The variability in the cost components is evident in the statistics. The mean cost share of oils and fats relative to starchy staples (cost_of_ss) is approximately 0.27, with a standard deviation of 0.10. Similarly, the mean cost share of fruits relative to starchy staples (cost_f_healthy) is approximately 0.20, showing some variation in the dietary composition cost across countries. The mean cost share for animal-sourced foods in a healthy diet (cost_asf_healthy) is approximately 0.27, reflecting the proportional expenditure on animal-sourced foods. It is noteworthy that the standard deviations for these cost components are relatively moderate, indicating a degree of consistency in the distribution of these cost shares across countries.
The minimum and maximum values for each variable provide additional context, highlighting the range of variation observed in the dataset, which may be due to differences in population size and socioeconomic conditions (economic disruption, wars/conflicts, displacement/migration, international aid, etc.). Countries with larger populations may exhibit greater diversity in income levels and living standards, leading to a broader range of affordability percentages. Smaller populations, on the other hand, may experience more homogeneity in these indicators.
# Visualtisation for each variable
num_cols=5
num_rows=1
plt.figure(figsize=(10, 20))
for i, variable in enumerate(relevant_variables, 1):
plt.subplot(5,1, i)
sns.histplot(df[variable], bins=20, kde=True, color='skyblue', edgecolor='black')
plt.title('Distribution of ' + variable)
plt.xlabel(variable)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
This variable measures the relationship between the cost of a healthy diet and food expenditure. More precisely, it quantifies the financial capacity of the population to afford a nutritionally healthy diet by relating the cost of such a diet to total food expenditure.
The graph shows us that the values of this variable are almost all greater than 0.5, which indicates that the cost of healthy eating is a higher one compared to total expenses, which can signal financial challenges for the population in terms of accessibility to healthy food
The variable cost_of_ss (Cost of oils and fats relative to the starchy staples in a least-cost healthy diet) measures the cost of oils and fats relative to starchy foods in a least-cost healthy diet. This variable gives an indication of the distribution of costs of oils and fats relative to starchy staple foods in a diet considered healthy and economical.
The histogram shows that the values of the variable CoHD_of_ss are close to zero, this suggests that the cost of oils and fats is relatively low compared to starchy foods in a lower-cost healthy diet
The variable cost_f_ss (Cost of fruits relative to the starchy staples in a least-cost healthy diet) measures the cost of fruits relative to starchy foods in a least-cost healthy diet. This variable provides an indication of the distribution of costs of fruit versus starchy staples in a diet considered healthy and economical.
We find that fruits are more expensive than high-starch foods in a lower-cost healthy diet. Indeed, most of the observed values of this variable are greater than 1.
The variable cost_asf_healthy (Cost share for animal-sourced foods in a least-cost healthy diet) measures the share of costs attributed to foods of animal origin compared to total costs in a least-cost healthy diet. This variable gives an indication of the proportion of costs associated with products of animal origin compared to the overall cost of a diet considered healthy and economical.
Here, the values of the variable CoHD_asf_prop is close to zero, this suggests that the share of costs attributed to foods of animal origin is relatively low compared to the total costs in a lower-cost healthy diet.
sns.pairplot(data=df, vars=relevant_variables, hue='pop')
plt.show()
C:\ProgramData\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
We will now perform a bivariate analysis between the target variable and a few variables that are relevant to study.
# Country
We want to visualize the percentage of the population that cannot afford a healthy diet by country. But given that we have several countries in the dataset, we will represent on one hand the 5 countries with the highest percentage of people who cannot afford a healthy diet and on the other hand the 5 countries with the lowest percentage.
df_sort = df.sort_values(by='per_notaff_healthy', ascending=False)
df_sort = df_sort.dropna()
# Selection of the 5 countries with the highest percentages
top3_high = df_sort.head(3)
# Selection of the 5 countries with the lowest percentages
top3_low = df_sort.tail(3)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
# ax1
ax1.bar(top3_high['country'], top3_high['per_notaff_healthy'])
ax1.set_title('Top 3 countries with the highest percentages')
# ax2
ax2.bar(top3_low['country'], top3_low['per_notaff_healthy'])
ax2.set_title('Top 3 countries with the lowest percentages')
plt.tight_layout()
plt.show()
The two graphs above allow us to see that Burundi is the country with the highest percentage of people who cannot afford to have a healthy diet (99%), followed by Madagascar (98%), and Liberia (97%). As for Switzerland, Iceland and the United Arab Emirates, all people living there manage to have a healthy diet
# Affordability of a healthy diet: ratio of cost to food expenditures [CoHD_fexp]
plt.figure(figsize=(8, 6))
sns.scatterplot(x='aff_nutrient_cost_fexp', y=target_variable, data=df)
plt.title('Relationship between aff_nutrient_cost_fexp and ' + target_variable)
plt.xlabel('aff_nutrient_cost_fexp')
plt.ylabel('per_notaff_healthy')
plt.show()
Interpretation : We can observe that in a relatively high number of countries, the cost of a healthy diet is low compared to total food expenditure. Nevertheless, for a considerable subset of countries, the cost of a healthy diet is 1 to 2.5 times more expensive than total food expenditure. In these instances, an observable correlation is discerned, indicating that the proportion of individuals unable to afford a healthy diet surpasses 80%.
# Cost of oils and fats relative to the starchy staples in a least-cost healthy diet [CoHD_of_ss]
plt.figure(figsize=(8, 6))
sns.scatterplot(x='cost_of_ss', y=target_variable, data=df)
plt.title('Relationship between cost_of_ss and ' + target_variable)
plt.xlabel('cost_of_ss')
plt.ylabel('per_notaff_healthy')
plt.show()
# Cost of fruits relative to the starchy staples in a least-cost healthy diet [CoHD_f_ss]
plt.figure(figsize=(8, 6))
sns.scatterplot(x='cost_f_ss', y=target_variable, data=df)
plt.title('Relationship between cost_f_ss and ' + target_variable)
plt.xlabel('cost_f_ss')
plt.ylabel('per_notaff_healthy')
plt.show()
# Cost share for animal-sourced foods in a least-cost healthy diet [CoHD_asf_prop]
plt.figure(figsize=(8, 6))
sns.scatterplot(x='cost_asf_healthy', y=target_variable, data=df)
plt.title('Relationship between cost_asf_healthy ' + target_variable)
plt.xlabel('cost_asf_healthy')
plt.ylabel('per_notaff_healthy')
plt.show()
The data from these last 3 scatter plots are variably distributed, this suggests significant variability in the relationship between the two variables examined.
This generally indicates that the relationship between the variables is not easily reduced to a simple linear trend. It may be helpful to further explore the underlying factors and consider other analyzes to better understand the complexity of the relationship between the variables involved.
In this part, we will check the connection that can have between the explanatory variables, that is to say those other than salary, but also between the explanatory variables and the variable to be explained. This helps identify potential relationships between variables, avoid multicollinearity (high correlation between predictors), and ensure that the assumptions underlying statistical or machine learning models are respected. Such verification also helps select the most relevant variables for the model, thereby improving the quality of predictions and the understanding of relationships between features of the dataset.
df.describe()
| pop | nb_notaff_healthy | nb_notaff_nutrient | nb_notaff_calories | per_notaff_healthy | per_notaff_nutrient | per_notaff_calories | aff_healthy_cost_fexp | aff_nutrient_cost_fexp | aff_calories_cost_fexp | ... | cost_of | cost_lns | cost_asf | cost_ss | cost_v | cost_f | cost_healthy_ss | cost_healthy | cost_nutrient | cost_energy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.440000e+02 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | ... | 744.000000 | 744.000000 | 744.000000 | 744.00000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 |
| mean | 1.584458e+08 | 80.211613 | 58.572258 | 9.243226 | 37.424355 | 27.531774 | 6.933065 | 0.808629 | 0.591792 | 0.212583 | ... | 0.130354 | 0.346642 | 0.872105 | 0.50771 | 0.791588 | 0.660972 | 4.530318 | 3.311839 | 2.453239 | 0.833306 |
| std | 6.576380e+08 | 314.645590 | 231.846900 | 37.591397 | 31.556200 | 26.615104 | 12.287847 | 0.637104 | 0.439582 | 0.193355 | ... | 0.051585 | 0.144053 | 0.207621 | 0.18330 | 0.265897 | 0.239354 | 1.886365 | 0.629742 | 0.647819 | 0.355677 |
| min | 2.956700e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.247000 | 0.167000 | 0.032000 | ... | 0.052000 | 0.069000 | 0.368000 | 0.14700 | 0.303000 | 0.162000 | 1.124543 | 1.822000 | 0.709000 | 0.239000 |
| 25% | 2.401840e+06 | 0.400000 | 0.200000 | 0.000000 | 4.000000 | 1.900000 | 0.200000 | 0.416750 | 0.313750 | 0.089500 | ... | 0.094750 | 0.257000 | 0.717500 | 0.40575 | 0.615250 | 0.518000 | 3.320000 | 2.871250 | 2.028250 | 0.608000 |
| 50% | 1.005770e+07 | 5.300000 | 3.950000 | 0.400000 | 37.424355 | 26.350000 | 2.200000 | 0.650000 | 0.461000 | 0.170000 | ... | 0.129500 | 0.333500 | 0.864000 | 0.49700 | 0.757000 | 0.640000 | 4.160000 | 3.273000 | 2.353000 | 0.801000 |
| 75% | 4.114406e+07 | 80.211613 | 58.572258 | 9.243226 | 65.725000 | 45.950000 | 6.933065 | 0.933500 | 0.726750 | 0.256250 | ... | 0.160000 | 0.407000 | 1.006000 | 0.58500 | 0.910750 | 0.786000 | 5.425000 | 3.619000 | 2.686000 | 1.009000 |
| max | 7.262507e+09 | 3133.400000 | 2293.800000 | 381.200000 | 97.500000 | 96.000000 | 78.500000 | 5.272000 | 3.913000 | 1.440000 | ... | 0.453000 | 0.875000 | 1.583000 | 1.47700 | 1.785000 | 1.997000 | 10.880000 | 5.975000 | 5.866000 | 2.958000 |
8 rows × 34 columns
df.dtypes
classif object country object pop float64 nb_notaff_healthy float64 nb_notaff_nutrient float64 nb_notaff_calories float64 per_notaff_healthy float64 per_notaff_nutrient float64 per_notaff_calories float64 aff_healthy_cost_fexp float64 aff_nutrient_cost_fexp float64 aff_calories_cost_fexp float64 aff_healthy_cost_pov float64 aff_nutrient_cost_pov float64 aff_calories_cost_pov float64 cost_of_ss float64 cost_lns_ss float64 cost_asf_ss float64 cost_v_ss float64 cost_of_healthy float64 cost_f_ss float64 cost_lns_healthy float64 cost_asf_healthy float64 cost_ss_healthy float64 cost_v_healthy float64 cost_f_healthy float64 cost_of float64 cost_lns float64 cost_asf float64 cost_ss float64 cost_v float64 cost_f float64 cost_healthy_ss float64 cost_healthy float64 cost_nutrient float64 cost_energy float64 dtype: object
# Separation of numerical and categorical variables
cat_col = df.dtypes[df.dtypes == 'object']
num_col = df.dtypes[df.dtypes != 'object']
fig = plt.figure(figsize=(14, 8))
cmap = 'viridis'
heatmap = sns.heatmap(df[list(num_col.index)].corr(), annot=True, square=True, cmap=cmap,
annot_kws={"size":5})
annot_font_size = 6
heatmap.set_xticklabels(heatmap.get_xticklabels(), fontsize=annot_font_size)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=annot_font_size)
plt.show()
print(f"Number of columns before deleting: {df.shape[1]}")
del_cols = ['nb_notaff_nutrient', 'nb_notaff_healthy',
'nb_notaff_calories','aff_healthy_cost_pov',
'aff_nutrient_cost_pov','aff_calories_cost_pov','cost_of_ss', 'cost_lns_ss',
'cost_asf_ss','cost_v_ss',
'cost_f_ss','cost_healthy_ss',
'cost_of',
'cost_lns','cost_asf',
'cost_ss','cost_v','cost_f','aff_healthy_cost_fexp',
'aff_nutrient_cost_fexp',
'aff_calories_cost_fexp',
'per_notaff_nutrient',
'cost_energy']
df.drop(labels = del_cols,axis = 1,inplace = True)
print(f"Number of columns after deleting: {df.shape[1]}")
Number of columns before deleting: 36 Number of columns after deleting: 13
# Séparation variables numériques et qualitative
cat_col2 = df.dtypes[df.dtypes == 'object']
num_col2 = df.dtypes[df.dtypes != 'object']
fig = plt.figure(figsize=(14, 8))
cmap = 'viridis'
heatmap = sns.heatmap(df[list(num_col2.index)].corr(), annot=True, square=True, cmap=cmap,
annot_kws={"size":9})
annot_font_size = 10
heatmap.set_xticklabels(heatmap.get_xticklabels(), fontsize=annot_font_size)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=annot_font_size)
plt.show()
df.to_csv('dfclean.csv', index = False)
df = pd.read_csv('dfclean.csv')
df.head()
| classif | country | pop | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FPN 1.0 | WORLD | 7.208181e+09 | 42.900000 | 5.400000 | 0.03962 | 0.104686 | 0.265811 | 0.152137 | 0.237366 | 0.199217 | 3.314 | 2.457 |
| 1 | FPN 1.0 | Albania | 2.873457e+06 | 37.800000 | 0.000000 | 0.02300 | 0.112000 | 0.305000 | 0.152000 | 0.179000 | 0.231000 | 3.952 | 2.471 |
| 2 | FPN 1.0 | Algeria | 4.138917e+07 | 35.200000 | 0.200000 | 0.03500 | 0.131000 | 0.256000 | 0.132000 | 0.186000 | 0.260000 | 3.763 | 2.323 |
| 3 | FPN 1.0 | Angola | 2.981677e+07 | 92.900000 | 57.200000 | 0.03800 | 0.091000 | 0.234000 | 0.194000 | 0.277000 | 0.166000 | 4.327 | 3.231 |
| 4 | FPN 1.0 | Anguilla | 1.584458e+08 | 37.424355 | 6.933065 | 0.03600 | 0.088000 | 0.174000 | 0.209000 | 0.274000 | 0.218000 | 3.717 | 2.433 |
df.describe()
| pop | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.440000e+02 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 | 744.000000 |
| mean | 1.584458e+08 | 37.424355 | 6.933065 | 0.039620 | 0.104686 | 0.265811 | 0.152137 | 0.237366 | 0.199217 | 3.311839 | 2.453239 |
| std | 6.576380e+08 | 31.556200 | 12.287847 | 0.014279 | 0.038443 | 0.053360 | 0.041622 | 0.058733 | 0.053770 | 0.629742 | 0.647819 |
| min | 2.956700e+04 | 0.000000 | 0.000000 | 0.015000 | 0.022000 | 0.123000 | 0.067000 | 0.108000 | 0.065000 | 1.822000 | 0.709000 |
| 25% | 2.401840e+06 | 4.000000 | 0.200000 | 0.029000 | 0.083750 | 0.231000 | 0.123000 | 0.202000 | 0.163750 | 2.871250 | 2.028250 |
| 50% | 1.005770e+07 | 37.424355 | 2.200000 | 0.039000 | 0.104000 | 0.265811 | 0.150000 | 0.237000 | 0.199217 | 3.273000 | 2.353000 |
| 75% | 4.114406e+07 | 65.725000 | 6.933065 | 0.046000 | 0.119000 | 0.299250 | 0.177000 | 0.274000 | 0.235000 | 3.619000 | 2.686000 |
| max | 7.262507e+09 | 97.500000 | 78.500000 | 0.106000 | 0.264000 | 0.413000 | 0.316000 | 0.487000 | 0.423000 | 5.975000 | 5.866000 |
Since a significant portion of our variables are continuous, and considering the upcoming phase of our study requires their conversion into categorical variables, a thoughtful transformation process becomes necessary. Therefore, it is crucial to identify a suitable approach to apply this transformation to our variables.
To begin, we will explore the computation of both the median and, subsequently, the mean for each categorical variable.
# Calculation of medians
median_pop = df['pop'].median()
print("Median Population:", median_pop)
median_per_notaff_healthy = df['per_notaff_healthy'].median()
print("Median % of the population who cannot afford a healthy diet:", median_per_notaff_healthy)
median_per_notaff_calories = df['per_notaff_calories'].median()
print("Median % of the population who cannot afford sufficient calories:", median_per_notaff_calories)
median_cost_of_healthy = df['cost_of_healthy'].median()
print("Median cost share for oils and fats in a least-cost healthy diet:", median_cost_of_healthy)
median_cost_lns_healthy = df['cost_lns_healthy'].median()
print("Median cost share for legumes, nuts and seeds in a least-cost healthy diet:", median_cost_lns_healthy)
median_cost_asf_healthy = df['cost_asf_healthy'].median()
print("Median cost share for animal-sourced foods in a least-cost healthy diet:", median_cost_asf_healthy)
median_cost_ss_healthy = df['cost_ss_healthy'].median()
print("Median cost share for starchy staples in a least-cost healthy diet:", median_cost_ss_healthy)
median_cost_v_healthy = df['cost_v_healthy'].median()
print("Median cost share for vegetables in a least-cost healthy diet:", median_cost_v_healthy)
median_cost_f_healthy = df['cost_f_healthy'].median()
print("Median cost share for fruits in a least-cost healthy diet:", median_cost_f_healthy)
median_cost_healthy = df['cost_healthy'].median()
print("Median cost of a healthy diet:", median_cost_healthy)
median_cost_nutrient = df['cost_nutrient'].median()
print("Median cost of a nutrient adequate diet:", median_cost_nutrient)
Median Population: 10057698.0 Median % of the population who cannot afford a healthy diet: 37.42435483870968 Median % of the population who cannot afford sufficient calories: 2.2 Median cost share for oils and fats in a least-cost healthy diet: 0.039 Median cost share for legumes, nuts and seeds in a least-cost healthy diet: 0.104 Median cost share for animal-sourced foods in a least-cost healthy diet: 0.2658114285714285 Median cost share for starchy staples in a least-cost healthy diet: 0.15 Median cost share for vegetables in a least-cost healthy diet: 0.237 Median cost share for fruits in a least-cost healthy diet: 0.1992171428571428 Median cost of a healthy diet: 3.273 Median cost of a nutrient adequate diet: 2.353
# Calculation of means
mean_pop = df['pop'].mean()
print("Mean Population:", mean_pop)
mean_per_notaff_healthy = df['per_notaff_healthy'].mean()
print("Mean % of the population who cannot afford a healthy diet:", mean_per_notaff_healthy)
mean_per_notaff_calories = df['per_notaff_calories'].mean()
print("Mean % of the population who cannot afford sufficient calories:", mean_per_notaff_calories)
mean_cost_of_healthy = df['cost_of_healthy'].mean()
print("Mean cost share for oils and fats in a least-cost healthy diet:", mean_cost_of_healthy)
mean_cost_lns_healthy = df['cost_lns_healthy'].mean()
print("Mean cost share for legumes, nuts and seeds in a least-cost healthy diet:", mean_cost_lns_healthy)
mean_cost_asf_healthy = df['cost_asf_healthy'].mean()
print("Mean cost share for animal-sourced foods in a least-cost healthy diet:", mean_cost_asf_healthy)
mean_cost_ss_healthy = df['cost_ss_healthy'].mean()
print("Mean cost share for starchy staples in a least-cost healthy diet:", mean_cost_ss_healthy)
mean_cost_v_healthy = df['cost_v_healthy'].mean()
print("Mean cost share for vegetables in a least-cost healthy diet:", mean_cost_v_healthy)
mean_cost_f_healthy = df['cost_f_healthy'].mean()
print("Mean cost share for fruits in a least-cost healthy diet:", mean_cost_f_healthy)
mean_cost_healthy = df['cost_healthy'].mean()
print("Mean cost of a healthy diet:", mean_cost_healthy)
mean_cost_nutrient = df['cost_nutrient'].mean()
print("Mean cost of a nutrient adequate diet:", mean_cost_nutrient)
Mean Population: 158445773.61202186 Mean % of the population who cannot afford a healthy diet: 37.42435483870968 Mean % of the population who cannot afford sufficient calories: 6.933064516129034 Mean cost share for oils and fats in a least-cost healthy diet: 0.03962 Mean cost share for legumes, nuts and seeds in a least-cost healthy diet: 0.10468571428571427 Mean cost share for animal-sourced foods in a least-cost healthy diet: 0.26581142857142853 Mean cost share for starchy staples in a least-cost healthy diet: 0.15213714285714286 Mean cost share for vegetables in a least-cost healthy diet: 0.2373657142857143 Mean cost share for fruits in a least-cost healthy diet: 0.19921714285714287 Mean cost of a healthy diet: 3.3118389088397793 Mean cost of a nutrient adequate diet: 2.453239247311828
| Continuous Feature | Median | Mean |
|---|---|---|
| pop | 10057698.0 | 158445773.61 |
| per_notaff_healthy | 37.42 | 37.42 |
| per_notaff_calories | 2.2 | 6.93 |
| cost_of_healthy | 0.0390 | 0.0396 |
| cost_lns_healthy | 0.1040 | 0.1046 |
| cost_asf_healthy | 0.2658 | 0.2658 |
| cost_ss_healthy | 0.1500 | 0.1521 |
| cost_v_healthy | 0.2370 | 0.2373 |
| cost_f_healthy | 0.1992 | 0.1992 |
| cost_healthy | 3.27 | 3.31 |
| cost_nutrient | 2.35 | 2.45 |
The table above resumes the values for the median and the mean for each continuous variable in our data set.
In the following stage of our project, the continuous variables will be transformed into categorical counterparts. This discretization will be accomplished utilizing the mean of each variable as the delineating threshold. This decision was made deliberately, as opting for the median would consistently result in two well-balanced classes (50%, 50%) for each variable. However, we wanted to avoid such uniformity and introduce more variability in the categories.
For the 'population' variable, a different approach will be employed. In this case, we will utilise the median as the delineating threshold. This choice is made considering the nature of population data, where the median serves as a more representative measure in the presence of potential outliers or skewed distributions.
# Creating a new categorical feature based on the mean
df_cat = pd.DataFrame()
df_cat['country'] = df['country']
df_cat['classif'] = df['classif']
df_cat['pop_cat'] = pd.cut(df['pop'], bins=[float('-inf'), median_pop, float('inf')], labels=['Less than or equal to 10057698', 'More than 10057698'])
df_cat['per_notaff_healthy'] = pd.cut(df['per_notaff_healthy'], bins=[float('-inf'), mean_per_notaff_healthy, float('inf')], labels=['Less than or equal to 37.42%', 'More than 37.42%'])
df_cat['per_notaff_calories'] = pd.cut(df['per_notaff_calories'], bins=[float('-inf'), mean_per_notaff_calories, float('inf')], labels=['Less than or equal to 6.93%', 'More than 6.93%'])
df_cat['cost_of_healthy'] = pd.cut(df['cost_of_healthy'], bins=[float('-inf'), mean_cost_of_healthy, float('inf')], labels=['Less than or equal to 0.0396', 'More than 0.0396'])
df_cat['cost_lns_healthy'] = pd.cut(df['cost_lns_healthy'], bins=[float('-inf'), mean_cost_lns_healthy, float('inf')], labels=['Less than or equal to 0.1046', 'More than 0.1046'])
df_cat['cost_asf_healthy'] = pd.cut(df['cost_asf_healthy'], bins=[float('-inf'), mean_cost_asf_healthy, float('inf')], labels=['Less than or equal to 0.2658', 'More than 0.2658'])
df_cat['cost_ss_healthy'] = pd.cut(df['cost_ss_healthy'], bins=[float('-inf'), mean_cost_ss_healthy, float('inf')], labels=['Less than or equal to 0.1521', 'More than 0.1521'])
df_cat['cost_v_healthy'] = pd.cut(df['cost_v_healthy'], bins=[float('-inf'), mean_cost_v_healthy, float('inf')], labels=['Less than or equal to 0.2373', 'More than 0.2373'])
df_cat['cost_f_healthy'] = pd.cut(df['cost_f_healthy'], bins=[float('-inf'), mean_cost_f_healthy, float('inf')], labels=['Less than or equal to 0.1992', 'More than 0.1992'])
df_cat['cost_healthy'] = pd.cut(df['cost_healthy'], bins=[float('-inf'), mean_cost_healthy, float('inf')], labels=['Less than or equal to 3.31', 'More than 3.31'])
df_cat['cost_nutrient'] = pd.cut(df['cost_nutrient'], bins=[float('-inf'), mean_cost_nutrient, float('inf')], labels=['Less than or equal to 2.45', 'More than 2.45'])
df_cat.head(10)
| country | classif | pop_cat | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WORLD | FPN 1.0 | More than 10057698 | More than 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | Less than or equal to 0.1992 | More than 3.31 | More than 2.45 |
| 1 | Albania | FPN 1.0 | Less than or equal to 10057698 | More than 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | More than 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | More than 0.1992 | More than 3.31 | More than 2.45 |
| 2 | Algeria | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | More than 0.1992 | More than 3.31 | Less than or equal to 2.45 |
| 3 | Angola | FPN 1.0 | More than 10057698 | More than 37.42% | More than 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | More than 0.1521 | More than 0.2373 | Less than or equal to 0.1992 | More than 3.31 | More than 2.45 |
| 4 | Anguilla | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | More than 0.1521 | More than 0.2373 | More than 0.1992 | More than 3.31 | Less than or equal to 2.45 |
| 5 | Antigua and Barbuda | FPN 1.0 | Less than or equal to 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | More than 0.2373 | More than 0.1992 | More than 3.31 | More than 2.45 |
| 6 | Argentina | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | More than 0.1992 | More than 3.31 | More than 2.45 |
| 7 | Armenia | FPN 1.0 | Less than or equal to 10057698 | More than 37.42% | Less than or equal to 6.93% | More than 0.0396 | More than 0.1046 | More than 0.2658 | More than 0.1521 | Less than or equal to 0.2373 | Less than or equal to 0.1992 | Less than or equal to 3.31 | Less than or equal to 2.45 |
| 8 | Aruba | FPN 1.0 | Less than or equal to 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | Less than or equal to 0.2658 | More than 0.1521 | More than 0.2373 | Less than or equal to 0.1992 | More than 3.31 | More than 2.45 |
| 9 | Australia | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | More than 0.2373 | More than 0.1992 | Less than or equal to 3.31 | Less than or equal to 2.45 |
df_cat.describe()
| country | classif | pop_cat | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 |
| unique | 186 | 4 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| top | WORLD | FPN 1.0 | Less than or equal to 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | Less than or equal to 0.1992 | Less than or equal to 3.31 | Less than or equal to 2.45 |
| freq | 4 | 186 | 373 | 476 | 589 | 418 | 434 | 408 | 434 | 424 | 394 | 384 | 418 |
df_cat.to_csv('df_cat.csv', index = False)
df = pd.read_csv('df_cat.csv')
df.head()
| country | classif | pop_cat | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | WORLD | FPN 1.0 | More than 10057698 | More than 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | Less than or equal to 0.1992 | More than 3.31 | More than 2.45 |
| 1 | Albania | FPN 1.0 | Less than or equal to 10057698 | More than 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | More than 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | More than 0.1992 | More than 3.31 | More than 2.45 |
| 2 | Algeria | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | More than 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | More than 0.1992 | More than 3.31 | Less than or equal to 2.45 |
| 3 | Angola | FPN 1.0 | More than 10057698 | More than 37.42% | More than 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | More than 0.1521 | More than 0.2373 | Less than or equal to 0.1992 | More than 3.31 | More than 2.45 |
| 4 | Anguilla | FPN 1.0 | More than 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | More than 0.1521 | More than 0.2373 | More than 0.1992 | More than 3.31 | Less than or equal to 2.45 |
df.describe()
| country | classif | pop_cat | per_notaff_healthy | per_notaff_calories | cost_of_healthy | cost_lns_healthy | cost_asf_healthy | cost_ss_healthy | cost_v_healthy | cost_f_healthy | cost_healthy | cost_nutrient | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 | 744 |
| unique | 186 | 4 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| top | WORLD | FPN 1.0 | Less than or equal to 10057698 | Less than or equal to 37.42% | Less than or equal to 6.93% | Less than or equal to 0.0396 | Less than or equal to 0.1046 | Less than or equal to 0.2658 | Less than or equal to 0.1521 | Less than or equal to 0.2373 | Less than or equal to 0.1992 | Less than or equal to 3.31 | Less than or equal to 2.45 |
| freq | 4 | 186 | 373 | 476 | 589 | 418 | 434 | 408 | 434 | 424 | 394 | 384 | 418 |
# Target variable
target = df['per_notaff_healthy']
df.drop(columns = ['per_notaff_healthy'], inplace=True)
# Independent variables
predictors = pd.get_dummies(df)
predictors.astype(int).head()
| country_Albania | country_Algeria | country_Angola | country_Anguilla | country_Antigua and Barbuda | country_Argentina | country_Armenia | country_Aruba | country_Australia | country_Austria | ... | cost_ss_healthy_Less than or equal to 0.1521 | cost_ss_healthy_More than 0.1521 | cost_v_healthy_Less than or equal to 0.2373 | cost_v_healthy_More than 0.2373 | cost_f_healthy_Less than or equal to 0.1992 | cost_f_healthy_More than 0.1992 | cost_healthy_Less than or equal to 3.31 | cost_healthy_More than 3.31 | cost_nutrient_Less than or equal to 2.45 | cost_nutrient_More than 2.45 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
| 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
5 rows × 210 columns
predictors.shape
(744, 210)
predictors.to_csv('predictors.csv', index = False)
# Training and test sets
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.4, random_state=15)
print('X_train shape is ' , X_train.shape)
print('X_test shape is ' , X_test.shape)
print('y_train shape is ' , y_train.shape)
print('y_test shape is ' , y_test.shape)
X_train shape is (446, 210) X_test shape is (298, 210) y_train shape is (446,) y_test shape is (298,)
The train-test split phase is a crucial step in the modeling process, especially considering the relatively modest size of our dataset, comprising 744 observations. This phase involves dividing our data into subsets, namely the training set and the test set, with the aim of training the model on one subset and evaluating its performance on another.
In our implementation, we have utilized the train_test_split function with a test size of 0.4, which corresponds to 40% of the data being allocated to the test set. Additionally, we have set random seeds (random_state) to ensure reproducibility in subsequent analyses.
# Training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.4, random_state=42)
print('X_train shape is ' , X_train.shape)
print('X_val shape is ' , X_val.shape)
print('y_train shape is ' , y_train.shape)
print('y_val shape is ' , y_val.shape)
X_train shape is (267, 210) X_val shape is (179, 210) y_train shape is (267,) y_val shape is (179,)
# centrer et réduire
scaler = MinMaxScaler(feature_range=(0, 1)).fit(X_train)
X_train_preprocess = scaler.transform(X_train)
X_train = pd.DataFrame(X_train_preprocess, columns = X_train.columns)
del X_train_preprocess
X_val_preprocess = scaler.transform(X_val)
X_val = pd.DataFrame(X_val_preprocess, columns = X_val.columns)
del X_val_preprocess
The feature scaling phase, is a pivotal step in preparing our data for modeling. In this process, we ensure that all features are standardized to a common range, to prevent certain variables from dominating the model due to their larger magnitudes.
Here, we have employed the MinMaxScaler from scikit-learn, configuring it to scale features within the range of 0 to 1. The scaler is fitted exclusively on the training set (X_train) to avoid data leakage from the validation set. The trained scaler is then applied to both the training and validation sets.
Logistic Regression is a statistical method used for binary classification. It's a type of regression analysis that is well-suited for predicting the probability of an event occurring. Logistic regression is analogous to multiple linear regression, except the outcome is binary.
We decided to employ this model first, becaue it is known for its interpretability and simplicity. The model's coefficients provide direct insights into the impact of each feature, aiding in the understanding of odds ratios and facilitating clear interpretation.
Illustrative diagram :
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)
model_logreg = LogisticRegression(C=0.1,
penalty='l2',
verbose=0,
solver='liblinear',
max_iter=5)
model_logreg.fit(X_train, y_train_encoded)
# Predictions
y_pred_train = model_logreg.predict(X_train)
y_pred_test = model_logreg.predict(X_test)
print("Logistic Regression Model Evaluation:\n")
print("Accuracy rate on the training set:", metrics.accuracy_score(y_train_encoded, y_pred_train))
print("Accuracy rate on the test set:\n", metrics.accuracy_score(y_test_encoded, y_pred_test))
# Metrics
lda_acc = accuracy_score(y_test_encoded, y_pred_test)
cm_lda = confusion_matrix(y_test_encoded, y_pred_test)
tpr_lda = cm_lda[1][1] / (cm_lda[1][0] + cm_lda[1][1])
TestError = np.mean(y_test_encoded != y_pred_test)
# ROC Curve and AUC
y_prob = model_logreg.predict_proba(X_test)[:, 1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_encoded, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.figure(figsize=(8, 8))
plt.plot(false_positive_rate,
true_positive_rate,
color='darkorange',
lw=2,
label='ROC curve (area = {:.2f})'.format(roc_auc))
print("\nThe area under the ROC curve is:", roc_auc)
print("The error on the test set is:\n", TestError)
print('\nAccuracy is:', lda_acc)
print('Sensitivity (TPR) is:', tpr_lda)
print('\nClassification report\n\n', classification_report(y_test_encoded, y_pred_test))
print('\nConfusion matrix\n\n', metrics.confusion_matrix(y_test_encoded, y_pred_test))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Logistic Regression Model Evaluation:
Accuracy rate on the training set: 0.9008403361344538
Accuracy rate on the test set:
0.9161073825503355
The area under the ROC curve is: 0.9440841194968553
The error on the test set is:
0.08389261744966443
Accuracy is: 0.9161073825503355
Sensitivity (TPR) is: 0.7924528301886793
Classification report
precision recall f1-score support
0 0.90 0.98 0.94 192
1 0.97 0.79 0.87 106
accuracy 0.92 298
macro avg 0.93 0.89 0.90 298
weighted avg 0.92 0.92 0.91 298
Confusion matrix
[[189 3]
[ 22 84]]
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
'max_iter': [100, 500, 1000]
}
grid_search = GridSearchCV(LogisticRegression(),
param_grid,
cv=StratifiedKFold(n_splits=3),
scoring='accuracy')
grid_search.fit(X_train, y_train_encoded)
best_params = grid_search.best_params_
best_params
{'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)
model_logreg2 = LogisticRegression(C=10,
penalty='l2',
verbose=5,
solver='lbfgs',
max_iter=100)
model_logreg2.fit(X_train, y_train_encoded)
# Predictions
y_pred_train = model_logreg2.predict(X_train)
y_pred_test = model_logreg2.predict(X_test)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.2s finished
print("Optimised Logistic Regression Model Evaluation:\n")
print("Accuracy rate on the training set:", metrics.accuracy_score(y_train_encoded, y_pred_train))
print("Accuracy rate on the test set:\n", metrics.accuracy_score(y_test_encoded, y_pred_test))
# Metrics
optimised_logreg_acc = accuracy_score(y_test_encoded, y_pred_test)
optimised_logreg_cm = confusion_matrix(y_test_encoded, y_pred_test)
optimised_logreg_recall = optimised_logreg_cm[1][1] / (optimised_logreg_cm[1][0] + optimised_logreg_cm[1][1])
TestError = np.mean(y_test_encoded != y_pred_test)
print("\nThe AUC is:", roc_auc)
print("The error on the test set is:\n", TestError)
print('\nAccuracy =', optimised_logreg_acc)
print('Sensitivity (TPR) =', optimised_logreg_tpr)
print('\nClassification report:\n\n', classification_report(y_test_encoded, y_pred_test))
print('\nConfusion matrix:\n\n', metrics.confusion_matrix(y_test_encoded, y_pred_test))
# ROC Curve and AUC
y_prob = model_logreg2.predict_proba(X_test)[:, 1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_encoded, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.figure(figsize=(8, 8))
plt.plot(false_positive_rate,
true_positive_rate,
color='darkorange',
lw=2,
label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Optimised Logistic Regression Model Evaluation:
Accuracy rate on the training set: 0.9932773109243698
Accuracy rate on the test set:
0.9932885906040269
The AUC is: 1.0
The error on the test set is:
0.006711409395973154
Accuracy = 0.9932885906040269
Sensitivity (TPR) = 0.9245283018867925
Classification report:
precision recall f1-score support
0 0.99 1.00 0.99 192
1 1.00 0.98 0.99 106
accuracy 0.99 298
macro avg 0.99 0.99 0.99 298
weighted avg 0.99 0.99 0.99 298
Confusion matrix:
[[192 0]
[ 2 104]]
model_logreg2_coeffs = pd.DataFrame(np.concatenate([model_logreg2.intercept_.reshape(-1,1),
model_logreg2.coef_],axis=1),
index = ["Coefficient"],
columns = ["Intercept"]+list(X_train.columns)).T
lr_def2_coeffs_sorted = model_logreg2_coeffs.sort_values(by="Coefficient", ascending=False)
model_logreg2_coeffs.head(30)
| Coefficient | |
|---|---|
| Intercept | 0.854925 |
| country_Albania | 1.488345 |
| country_Algeria | -0.340967 |
| country_Angola | 0.129087 |
| country_Anguilla | -0.851832 |
| country_Antigua and Barbuda | -1.232430 |
| country_Argentina | -0.502015 |
| country_Armenia | 1.751001 |
| country_Aruba | -0.113628 |
| country_Australia | -0.161972 |
| country_Austria | -0.154566 |
| country_Azerbaijan | -0.800222 |
| country_Bahamas, The | -0.771635 |
| country_Bahrain | -0.981420 |
| country_Bangladesh | 2.435854 |
| country_Barbados | -1.114709 |
| country_Belarus | -0.072100 |
| country_Belgium | -0.282446 |
| country_Belize | 0.100078 |
| country_Benin | 0.044811 |
| country_Bermuda | -0.507128 |
| country_Bhutan | 3.144796 |
| country_Bolivia | -4.262147 |
| country_Bonaire | -1.155417 |
| country_Bosnia and Herzegovina | -0.305812 |
| country_Botswana | 4.756429 |
| country_Brazil | -0.224000 |
| country_British Virgin Islands | -1.238214 |
| country_Brunei Darussalam | -1.311671 |
| country_Bulgaria | -0.376321 |
lr_def2_coeffs_sorted.head(30)
| Coefficient | |
|---|---|
| country_Guyana | 6.100476 |
| country_WORLD | 5.467703 |
| country_Tajikistan | 4.945516 |
| country_Kyrgyz Republic | 4.804847 |
| country_Botswana | 4.756429 |
| country_Indonesia | 4.508082 |
| country_Cabo Verde | 4.349958 |
| country_Mongolia | 4.348021 |
| country_Senegal | 4.258609 |
| country_Egypt, Arab Rep. | 4.135586 |
| country_Zimbabwe | 3.814598 |
| country_South Asia | 3.448173 |
| country_Myanmar | 3.350139 |
| per_notaff_calories_More than 6.93% | 3.326263 |
| country_Jamaica | 3.237392 |
| country_Bhutan | 3.144796 |
| country_Djibouti | 3.137345 |
| country_Bangladesh | 2.435854 |
| country_Mali | 2.435854 |
| country_India | 2.188353 |
| country_Lao PDR | 2.168420 |
| country_Mauritania | 2.155055 |
| country_Burkina Faso | 2.130609 |
| country_Philippines | 2.110279 |
| country_Fiji | 2.103685 |
| country_Sri Lanka | 2.033906 |
| country_Malawi | 1.836686 |
| country_Pakistan | 1.821205 |
| country_Armenia | 1.751001 |
| country_Lower middle income | 1.572847 |
lr_def2_coeffs_sorted.tail(30)
| Coefficient | |
|---|---|
| country_Togo | -1.244420 |
| country_Sint Maarten (Dutch part) | -1.275576 |
| country_Comoros | -1.276345 |
| country_West Bank and Gaza | -1.287807 |
| country_North America | -1.290949 |
| country_Brunei Darussalam | -1.311671 |
| country_Latin America & Caribbean | -1.359850 |
| country_El Salvador | -1.418446 |
| country_Korea, Rep. | -1.468001 |
| country_Israel | -1.519191 |
| country_Panama | -1.521795 |
| country_Montserrat | -1.596527 |
| country_Dominican Republic | -1.618795 |
| country_Switzerland | -1.625667 |
| country_Italy | -1.728394 |
| country_Iraq | -1.790022 |
| country_Cambodia | -1.963791 |
| country_Jordan | -2.162929 |
| country_Trinidad and Tobago | -2.163695 |
| country_Colombia | -2.193491 |
| country_Nicaragua | -2.200763 |
| country_Hong Kong SAR, China | -2.327314 |
| country_Vietnam | -2.404453 |
| country_United Arab Emirates | -2.408125 |
| country_Russian Federation | -2.519483 |
| country_Türkiye | -2.701884 |
| country_Saudi Arabia | -2.867361 |
| country_Kuwait | -2.886992 |
| per_notaff_calories_Less than or equal to 6.93% | -3.324808 |
| country_Bolivia | -4.262147 |
coefficients = np.concatenate([model_logreg2.intercept_.reshape(-1, 1), model_logreg2.coef_], axis=1)
feature_names = ["Intercept"] + list(X_train.columns)
table = pd.DataFrame(coefficients, index=["Coefficient"], columns=feature_names).T
table['Odds Ratio'] = np.exp(model_logreg2_coeffs['Coefficient'])
table['Exponential Function'] = "exp(" + model_logreg2_coeffs.index + " * " + str(round(model_logreg2_coeffs['Coefficient'][0], 4)) + ")"
intercept_prob = np.exp(model_logreg2_coeffs['Coefficient'][0]) / (1 + np.exp(model_logreg2_coeffs['Coefficient'][0]))
table.at['Intercept', 'Transformed Probability'] = intercept_prob
table['Transformed Probability'] = 1 / (1 + np.exp(-(model_logreg2_coeffs['Coefficient'])))
table.head(20)
| Coefficient | Odds Ratio | Exponential Function | Transformed Probability | |
|---|---|---|---|---|
| Intercept | 0.854925 | 2.351198 | exp(Intercept * 0.8549) | 0.701599 |
| country_Albania | 1.488345 | 4.429756 | exp(country_Albania * 0.8549) | 0.815830 |
| country_Algeria | -0.340967 | 0.711082 | exp(country_Algeria * 0.8549) | 0.415575 |
| country_Angola | 0.129087 | 1.137789 | exp(country_Angola * 0.8549) | 0.532227 |
| country_Anguilla | -0.851832 | 0.426633 | exp(country_Anguilla * 0.8549) | 0.299049 |
| country_Antigua and Barbuda | -1.232430 | 0.291583 | exp(country_Antigua and Barbuda * 0.8549) | 0.225756 |
| country_Argentina | -0.502015 | 0.605310 | exp(country_Argentina * 0.8549) | 0.377067 |
| country_Armenia | 1.751001 | 5.760365 | exp(country_Armenia * 0.8549) | 0.852079 |
| country_Aruba | -0.113628 | 0.892590 | exp(country_Aruba * 0.8549) | 0.471624 |
| country_Australia | -0.161972 | 0.850465 | exp(country_Australia * 0.8549) | 0.459595 |
| country_Austria | -0.154566 | 0.856787 | exp(country_Austria * 0.8549) | 0.461435 |
| country_Azerbaijan | -0.800222 | 0.449229 | exp(country_Azerbaijan * 0.8549) | 0.309978 |
| country_Bahamas, The | -0.771635 | 0.462257 | exp(country_Bahamas, The * 0.8549) | 0.316126 |
| country_Bahrain | -0.981420 | 0.374778 | exp(country_Bahrain * 0.8549) | 0.272610 |
| country_Bangladesh | 2.435854 | 11.425570 | exp(country_Bangladesh * 0.8549) | 0.919521 |
| country_Barbados | -1.114709 | 0.328011 | exp(country_Barbados * 0.8549) | 0.246994 |
| country_Belarus | -0.072100 | 0.930438 | exp(country_Belarus * 0.8549) | 0.481983 |
| country_Belgium | -0.282446 | 0.753937 | exp(country_Belgium * 0.8549) | 0.429854 |
| country_Belize | 0.100078 | 1.105257 | exp(country_Belize * 0.8549) | 0.524999 |
| country_Benin | 0.044811 | 1.045830 | exp(country_Benin * 0.8549) | 0.511201 |
table.tail(20)
| Coefficient | Odds Ratio | Exponential Function | Transformed Probability | |
|---|---|---|---|---|
| pop_cat_Less than or equal to 10057698 | -0.207547 | 0.812575 | exp(pop_cat_Less than or equal to 10057698 * 0... | 0.448299 |
| pop_cat_More than 10057698 | 0.209001 | 1.232447 | exp(pop_cat_More than 10057698 * 0.8549) | 0.552061 |
| per_notaff_calories_Less than or equal to 6.93% | -3.324808 | 0.035979 | exp(per_notaff_calories_Less than or equal to ... | 0.034730 |
| per_notaff_calories_More than 6.93% | 3.326263 | 27.834137 | exp(per_notaff_calories_More than 6.93% * 0.8549) | 0.965319 |
| cost_of_healthy_Less than or equal to 0.0396 | -0.559254 | 0.571635 | exp(cost_of_healthy_Less than or equal to 0.03... | 0.363720 |
| cost_of_healthy_More than 0.0396 | 0.560709 | 1.751914 | exp(cost_of_healthy_More than 0.0396 * 0.8549) | 0.636617 |
| cost_lns_healthy_Less than or equal to 0.1046 | 1.070683 | 2.917371 | exp(cost_lns_healthy_Less than or equal to 0.1... | 0.744727 |
| cost_lns_healthy_More than 0.1046 | -1.069228 | 0.343273 | exp(cost_lns_healthy_More than 0.1046 * 0.8549) | 0.255550 |
| cost_asf_healthy_Less than or equal to 0.2658 | -0.742855 | 0.475754 | exp(cost_asf_healthy_Less than or equal to 0.2... | 0.322380 |
| cost_asf_healthy_More than 0.2658 | 0.744309 | 2.104987 | exp(cost_asf_healthy_More than 0.2658 * 0.8549) | 0.677937 |
| cost_ss_healthy_Less than or equal to 0.1521 | 0.435144 | 1.545186 | exp(cost_ss_healthy_Less than or equal to 0.15... | 0.607101 |
| cost_ss_healthy_More than 0.1521 | -0.433689 | 0.648114 | exp(cost_ss_healthy_More than 0.1521 * 0.8549) | 0.393246 |
| cost_v_healthy_Less than or equal to 0.2373 | 0.039356 | 1.040141 | exp(cost_v_healthy_Less than or equal to 0.237... | 0.509838 |
| cost_v_healthy_More than 0.2373 | -0.037901 | 0.962808 | exp(cost_v_healthy_More than 0.2373 * 0.8549) | 0.490526 |
| cost_f_healthy_Less than or equal to 0.1992 | 0.128137 | 1.136708 | exp(cost_f_healthy_Less than or equal to 0.199... | 0.531990 |
| cost_f_healthy_More than 0.1992 | -0.126682 | 0.881014 | exp(cost_f_healthy_More than 0.1992 * 0.8549) | 0.468372 |
| cost_healthy_Less than or equal to 3.31 | -0.247751 | 0.780554 | exp(cost_healthy_Less than or equal to 3.31 * ... | 0.438377 |
| cost_healthy_More than 3.31 | 0.249206 | 1.283006 | exp(cost_healthy_More than 3.31 * 0.8549) | 0.561981 |
| cost_nutrient_Less than or equal to 2.45 | -0.154452 | 0.856884 | exp(cost_nutrient_Less than or equal to 2.45 *... | 0.461464 |
| cost_nutrient_More than 2.45 | 0.155907 | 1.168718 | exp(cost_nutrient_More than 2.45 * 0.8549) | 0.538898 |
The optimised model exhibits exceptional performance on both the training and test sets. The accuracy rate on the training set is notably high at 99.33%, demonstrating the model's ability to accurately classify instances within the training dataset. This high accuracy extends to the test set, where the model achieves a similar accuracy rate of 99.33%, indicating a theoretically robust generalization to new, unseen data.
The Area Under the Curve (AUC) is a perfect 1.0, emphasizing the model's flawless discrimination between positive and negative classes. The low error rate on the test set (0.67%) further attests to the model's exceptional performance in minimizing misclassifications.
The precision and recall metrics indicate outstanding performance, with precision values of 99% and 100% for classes 0 ('Less than or equal to 37.42%' and 1 ('More than 37.42%', respectively. Recall values, or sensitivity (TPR), are 100% for class 0 and 92.45% for class 1, demonstrating the model's ability to accurately capture a high percentage of actual positive instances.
The classification report provides a comprehensive overview of the model's performance, showcasing high F1-scores of 99% for both classes. The macro average and weighted average metrics further highlight the balanced performance across both classes.
The confusion matrix reinforces the model's excellent performance, revealing that the model correctly predicts all instances of class 0 and accurately identifies 104 instances of class 1, with only 2 false positive predictions. Overall, the optimised logistic regression model demonstrates exceptional predictive accuracy and robustness in handling imbalanced classes.
However, although the outcomes of this optimised model appear exceptional, there is a concern that they might be suggestive of overfitting, which is an undesirable scenario. Consequently, we will proceed to implement alternative models and evaluate them, to ensure robust and generalizable performance.
A decision tree is a decision support tool representing a set of choices in the graphic form of a tree. The different possible decisions are located at the ends of the branches (the “leaves” of the tree), and are reached based on decisions made at each stage. The decision tree is a tool used in various fields such as security, data mining, medicine, etc. In our case, we will use it as a prediction model (supervised learning).
Illustrative diagram :
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
model_dtree = DecisionTreeClassifier(max_depth=3,
min_samples_split=5,
min_samples_leaf=2) # => AUC : 0.8816
model_dtree.fit(X_train, y_train)
y_pred = model_dtree.predict(X_test)
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
y_prob = model_dtree.predict_proba(X_test)[:, 1]
y_pred_labels = np.where(y_prob > 0.5, 'Less than or equal to 37.42%', 'More than 37.42%')
print("Decision Trees Model Evaluation:\n")
print("Classification Report:\n")
print(classification_report(y_test, y_pred_labels))
print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_labels))
accuracy = accuracy_score(y_test, y_pred_labels)
precision = precision_score(y_test, y_pred_labels, pos_label='More than 37.42%', average='binary')
recall = recall_score(y_test, y_pred_labels, pos_label='More than 37.42%', average='binary')
f1 = f1_score(y_test, y_pred_labels, pos_label='More than 37.42%', average='binary')
print(f"\nAccuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
# ROC Curve and AUC
y_prob = model_dtree.predict_proba(X_test)[:, 1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_encoded, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.figure(figsize=(8, 8))
plt.plot(false_positive_rate,
true_positive_rate,
color='darkorange',
lw=2,
label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Decision Trees Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.17 0.08 0.11 192
More than 37.42% 0.15 0.30 0.20 106
accuracy 0.16 298
macro avg 0.16 0.19 0.15 298
weighted avg 0.16 0.16 0.14 298
Confusion Matrix:
[[ 15 177]
[ 74 32]]
Accuracy: 0.1577
Precision: 0.1531
Recall: 0.3019
F1 Score: 0.2032
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc
from sklearn.tree import plot_tree
param_grid = {
'max_depth': range(1, 21),
'max_leaf_nodes': range(10, 200, 10),
'min_samples_split': range(5, 10)
}
model_dtree2 = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(model_dtree2,
param_grid,
cv=5,
scoring='accuracy',
verbose=5,
n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best hyperparameters: ", grid_search.best_params_)
best_dtree = grid_search.best_estimator_
best_dtree.fit(X_train, y_train)
y_pred = best_dtree.predict(X_test)
Fitting 5 folds for each of 1900 candidates, totalling 9500 fits
Best hyperparameters: {'max_depth': 11, 'max_leaf_nodes': 30, 'min_samples_split': 5}
# Hyperparameter tuning for max_depth
train_acc1 = []
val_acc1 = []
for max_d in range(1,21):
model = DecisionTreeClassifier(max_depth=max_d, random_state=42)
model.fit(X_train, y_train)
train_acc1.append(model.score(X_train, y_train))
val_acc1.append(model.score(X_test,y_test))
line1, = plt.plot([*range(1,21)], train_acc1, 'b', label='Train accuracy')
line2, = plt.plot([*range(1,21)], val_acc1, 'r', label='Validation accuracy')
plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.title('Train vs Validation Accuracy : Max_depth')
plt.ylabel('Accuracy')
plt.xlabel('Max_depth')
plt.show()
train_acc1.clear()
val_acc1.clear()
# Hyperparameter tuning for max_leaf_nodes
train_acc2 = []
val_acc2 = []
for max_ln in range(10,200,10):
model2 = DecisionTreeClassifier(max_leaf_nodes=max_ln, random_state=42)
model2.fit(X_train, y_train)
train_acc2.append(model2.score(X_train, y_train))
val_acc2.append(model2.score(X_test,y_test))
line3, = plt.plot([*range(10,200,10)], train_acc2, 'b', label='Train accuracy')
line4, = plt.plot([*range(10,200,10)], val_acc2, 'r', label='Validation accuracy')
plt.legend(handler_map={line2: HandlerLine2D(numpoints=2)})
plt.title('Train vs Validation Accuracy : Max_leaf_nodes')
plt.ylabel('Accuracy')
plt.xlabel('Max_leaf_nodes')
plt.show()
train_acc2.clear()
val_acc2.clear()
# Hyperparameter tuning for min_samples_split
min_samples_split_values = [*range(2, 21)] # Adjust the range as needed
train_acc_split = []
val_acc_split = []
for min_split in min_samples_split_values:
model_split = DecisionTreeClassifier(min_samples_split=min_split, random_state=42)
model_split.fit(X_train, y_train)
train_acc_split.append(model_split.score(X_train, y_train))
val_acc_split.append(model_split.score(X_test, y_test))
# Plot the results
line5, = plt.plot(min_samples_split_values, train_acc_split, 'b', label='Train accuracy')
line6, = plt.plot(min_samples_split_values, val_acc_split, 'r', label='Validation accuracy')
plt.legend(handler_map={line5: HandlerLine2D(numpoints=2)})
plt.title('Train vs Validation Accuracy : Min_samples_split')
plt.ylabel('Accuracy')
plt.xlabel('Min_samples_split')
plt.show()
# Clear lists for reuse
train_acc_split.clear()
val_acc_split.clear()
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
# Hyperparameters suggested above : 'max_depth': 11, 'max_leaf_nodes': 30, 'min_samples_split': 5
best_hyperparameters = {'max_depth': 5,
'max_leaf_nodes': 30,
'min_samples_split' : 5}
optimised_dtree = DecisionTreeClassifier(
max_depth=best_hyperparameters['max_depth'],
max_leaf_nodes=best_hyperparameters['max_leaf_nodes'],
min_samples_split=5,
min_samples_leaf=2
)
optimised_dtree.fit(X_train, y_train)
y_pred_optimised = optimised_dtree.predict(X_test)
plt.figure(figsize=(16, 8))
plot_tree(optimised_dtree,
feature_names=X_train.columns,
class_names=True,
filled=True,
rounded=True,
fontsize=7,
impurity=False,
precision=5,
proportion=True)
plt.rcParams['lines.markersize'] = 8
plt.tight_layout()
plt.savefig("decision_tree.png")
plt.show()
y_pred_optimized = optimised_dtree.predict(X_test)
y_prob_optimized = optimised_dtree.predict_proba(X_test)[:, 1]
pos_label_optimized = label_encoder.transform(['More than 37.42%'])[0]
false_positive_rate_optimized, true_positive_rate_optimized, thresholds_optimized = roc_curve(y_test_encoded,
y_prob_optimized,
pos_label=pos_label_optimized)
roc_auc_optimized = auc(false_positive_rate_optimized, true_positive_rate_optimized)
print("\nOptimized Decision Trees Model Evaluation:\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_optimized))
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_optimized))
optimised_dtree_acc = accuracy_score(y_test, y_pred_optimized)
# optimised_dtree_cm = confusion_matrix(y_test_encoded, y_pred_test)
optimised_dtree_precision = precision_score(y_test, y_pred_optimized, pos_label='More than 37.42%', average='binary')
optimised_dtree_recall = recall_score(y_test, y_pred_optimized, pos_label='More than 37.42%', average='binary')
optimised_dtree_f1 = f1_score(y_test, y_pred_optimized, pos_label='More than 37.42%', average='binary')
# optimised_dtree_tpr = optimised_dtree_cm[1][1] / (optimised_dtree_cm[1][0] + optimised_dtree_cm[1][1])
print(f"\nAccuracy: {optimised_dtree_acc :.4f}")
print(f"Precision: {optimised_dtree_precision:.4f}")
print(f"Recall: {optimised_dtree_recall:.4f}")
print(f"F1 Score: {optimised_dtree_f1:.4f}")
# print(f"Sensitivity (TPR): {optimised_dtree_tpr:.4f}")
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_optimized, true_positive_rate_optimized, color='darkorange', lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_optimized))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_optimized:.4f}")
Optimized Decision Trees Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.87 0.96 0.91 192
More than 37.42% 0.91 0.74 0.81 106
accuracy 0.88 298
macro avg 0.89 0.85 0.86 298
weighted avg 0.88 0.88 0.88 298
Confusion Matrix:
[[184 8]
[ 28 78]]
Accuracy: 0.8792
Precision: 0.9070
Recall: 0.7358
F1 Score: 0.8125
AUC: 0.8986
The optimized decision tree model exhibited satisfactory performance, as indicated by the evaluation metrics. In terms of precision, the model achieved 87% for the class "Less than or equal to 37.42%" and 91% for "More than 37.42%". The recall values were 96% and 74% for these respective classes, emphasizing the model's ability to correctly identify instances of each class (4 out of 4 countries with a percentage of healthy diet unaffordability Less than or equal to 37.42% will be correctly identified, and 3 out of 4 countries with a percentage higher than 37.42 were correctly identified. The overall accuracy of the model was 88%.
The confusion matrixrevealed that the model correctly classified 184 instances of "Less than or equal to 37.42%" and 78 instances of "More than 37.42%", with 8 and 28 misclassifications, respectively.
The area under the ROC curve (AUC) further supported the model's discriminative ability, yielding a value of 0.8986. This suggests that the model has a good balance between true positive rate and false positive rate, indicating effective discrimination between the two classes.
This model presents a distinct trade-off compared to the previously optimised logistic regression model. While the logistic regression model had a higher precision (90.7% for decision tree vs. 100% for logistic regression) and overall accuracy (87.92% for decision tree vs. 99.33% for logistic regression), the decision tree model is more sensitive to identifying individuals with "More than 37.42%" in the unaffordability of a healthy diet expenditures, potentially capturing more positive cases. Given that our goal is to avoid false negatives and capture as many instances of the positive class as possible, the decision tree might be the preferred choice. Nevertheless, given that it is less precise overall, it has a higher likelihood of misclassifying instances. We are therefore going to implement and evaluate more models.
The random forest is an essential algorithm in machine learning. Random forest means “random forest”. It is an algorithm based on the assembly of decision trees. It is quite intuitive to understand, quick to train and it produces generalizable results. The only downside is that the random forest is a black box which gives results that are difficult to read, that is to say, not very explanatory.
Illustrative diagram :
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
model_rf = RandomForestClassifier(
n_estimators=3, # Number of trees in the forest
max_depth=None, # Maximum depth of the tree
min_samples_split=5, # Minimum number of samples required to split an internal node
min_samples_leaf=1, # Minimum number of samples required to be at a leaf node
max_features='auto', # Number of features to consider when looking for the best split
bootstrap=True, # Whether bootstrap samples are used when building trees
random_state=0, # Seed for random number generation
n_jobs=-1 # Number of jobs to run in parallel (set to -1 to use all available processors)
)
model_rf.fit(X_train, y_train)
y_pred_train = model_rf.predict(X_train)
y_pred_val = model_rf.predict(X_val)
from sklearn.preprocessing import LabelEncoder
print("Random Forest Model Evaluation:")
print("Classification Report:")
print(classification_report(y_val, y_pred_val))
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred_val))
# Additional metrics for binary classification
accuracy_forest = accuracy_score(y_val, y_pred_val)
precision_forest = precision_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
recall_forest = recall_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
f1_forest = f1_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
print(f"Accuracy: {accuracy_forest:.4f}")
print(f"Precision: {precision_forest:.4f}")
print(f"Recall: {recall_forest:.4f}")
print(f"F1 Score: {f1_forest:.4f}")
# ROC Curve and AUC for binary classification
y_prob_forest = model_rf.predict_proba(X_val)[:, 1]
# Specify pos_label based on your positive class
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
positive_class = 'More than 37.42%'
pos_label_forest = label_encoder.transform([positive_class])[0]
# Assuming you have a LabelEncoder instance named label_encoder
y_val_encoded = label_encoder.transform(y_val)
false_positive_rate_forest, true_positive_rate_forest, thresholds_forest = roc_curve(y_val_encoded,
y_prob_forest,
pos_label=pos_label_forest)
roc_auc_forest = auc(false_positive_rate_forest, true_positive_rate_forest)
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_forest, true_positive_rate_forest, color='darkorange', lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_forest))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_forest:.4f}")
Random Forest Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.87 0.94 0.90 115
More than 37.42% 0.87 0.75 0.81 64
accuracy 0.87 179
macro avg 0.87 0.84 0.86 179
weighted avg 0.87 0.87 0.87 179
Confusion Matrix:
[[108 7]
[ 16 48]]
Accuracy: 0.8715
Precision: 0.8727
Recall: 0.7500
F1 Score: 0.8067
AUC: 0.8933
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['auto', 'sqrt', 'log2'],
'bootstrap': [True, False]
}
model_rf2 = RandomForestClassifier(random_state=1, n_jobs=-1)
grid_search = GridSearchCV(model_rf2, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_forest = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
print("Best Forest:", grid_search.best_estimator_)
Best Parameters: {'bootstrap': False, 'max_depth': 20, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best Forest: RandomForestClassifier(bootstrap=False, max_depth=20, max_features='auto',
n_estimators=50, n_jobs=-1, random_state=1)
# Optimised Random Forest model with suggested parameters
optimised_rf = RandomForestClassifier(
n_estimators=50,
max_depth=20,
min_samples_split=2,
min_samples_leaf=1,
max_features='auto',
bootstrap=False,
random_state=1,
n_jobs=-1
)
optimised_rf.fit(X_train, y_train)
y_pred_train_optimised = optimised_rf.predict(X_train)
y_pred_val_optimised = optimised_rf.predict(X_val)
print("Optimized Random Forest Model Evaluation:\n")
print("Classification Report:\n")
print(classification_report(y_val, y_pred_val_optimised))
print("Confusion Matrix:\n")
print(confusion_matrix(y_val, y_pred_val_optimised))
optimised_rf_acc = accuracy_score(y_val, y_pred_val_optimised)
optimised_rf_cm = confusion_matrix(y_test_encoded, y_pred_test)
optimised_rf_precision = precision_score(y_val, y_pred_val_optimised, pos_label='More than 37.42%', average='binary')
optimised_rf_recall = recall_score(y_val, y_pred_val_optimised, pos_label='More than 37.42%', average='binary')
optimised_rf_f1 = f1_score(y_val, y_pred_val_optimised, pos_label='More than 37.42%', average='binary')
optimised_rf_tpr = optimised_rf_cm[1][1] / (optimised_rf_cm[1][0] + optimised_rf_cm[1][1])
print(f"\nAccuracy: {optimised_rf_acc:.4f}")
print(f"Precision: {optimised_rf_precision:.4f}")
print(f"Recall: {optimised_rf_recall:.4f}")
print(f"F1 Score: {optimised_rf_f1:.4f}")
print(f"Sensitivity (TPR): {optimised_rf_tpr:.4f}")
y_prob_forest_optimised = optimised_rf.predict_proba(X_val)[:, 1]
pos_label_forest_optimised = label_encoder.transform(['More than 37.42%'])[0]
y_val_encoded_optimised = label_encoder.transform(y_val)
false_positive_rate_forest_optimised, true_positive_rate_forest_optimised, thresholds_forest_optimised = roc_curve(
y_val_encoded_optimised, y_prob_forest_optimised, pos_label=pos_label_forest_optimised
)
roc_auc_forest_optimised = auc(false_positive_rate_forest_optimised, true_positive_rate_forest_optimised)
plt.figure(figsize=(10, 6))
plt.plot(
false_positive_rate_forest_optimised,
true_positive_rate_forest_optimised,
color='darkorange',
lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_forest_optimised),
)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_forest_optimised:.4f}")
Optimized Random Forest Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.96 0.96 0.96 115
More than 37.42% 0.92 0.94 0.93 64
accuracy 0.95 179
macro avg 0.94 0.95 0.95 179
weighted avg 0.95 0.95 0.95 179
Confusion Matrix:
[[110 5]
[ 4 60]]
Accuracy: 0.9497
Precision: 0.9231
Recall: 0.9375
F1 Score: 0.9302
Sensitivity (TPR): 0.9245
AUC: 0.9857
# Feature importance
df_importance = pd.DataFrame({'Variable':pd.DataFrame(X_test).columns,
'Importance':optimised_rf.feature_importances_}).sort_values('Importance', ascending=False)
df_importance.head(20)
| Variable | Importance | |
|---|---|---|
| 192 | per_notaff_calories_Less than or equal to 6.93% | 0.163708 |
| 193 | per_notaff_calories_More than 6.93% | 0.126980 |
| 198 | cost_asf_healthy_Less than or equal to 0.2658 | 0.042191 |
| 199 | cost_asf_healthy_More than 0.2658 | 0.033745 |
| 197 | cost_lns_healthy_More than 0.1046 | 0.032030 |
| 208 | cost_nutrient_Less than or equal to 2.45 | 0.024265 |
| 196 | cost_lns_healthy_Less than or equal to 0.1046 | 0.023746 |
| 206 | cost_healthy_Less than or equal to 3.31 | 0.022550 |
| 209 | cost_nutrient_More than 2.45 | 0.021962 |
| 195 | cost_of_healthy_More than 0.0396 | 0.021277 |
| 194 | cost_of_healthy_Less than or equal to 0.0396 | 0.021249 |
| 204 | cost_f_healthy_Less than or equal to 0.1992 | 0.019434 |
| 186 | classif_FPN 1.0 | 0.017504 |
| 191 | pop_cat_More than 10057698 | 0.017034 |
| 205 | cost_f_healthy_More than 0.1992 | 0.016698 |
| 207 | cost_healthy_More than 3.31 | 0.016059 |
| 201 | cost_ss_healthy_More than 0.1521 | 0.015674 |
| 155 | country_South Asia | 0.014603 |
| 182 | country_WORLD | 0.014173 |
| 95 | country_Kyrgyz Republic | 0.013657 |
df_importance.tail(20)
| Variable | Importance | |
|---|---|---|
| 88 | country_Jamaica | 0.0 |
| 8 | country_Australia | 0.0 |
| 7 | country_Aruba | 0.0 |
| 75 | country_Haiti | 0.0 |
| 74 | country_Guyana | 0.0 |
| 165 | country_Switzerland | 0.0 |
| 85 | country_Ireland | 0.0 |
| 171 | country_Trinidad and Tobago | 0.0 |
| 158 | country_St. Kitts and Nevis | 0.0 |
| 147 | country_Serbia | 0.0 |
| 129 | country_North America | 0.0 |
| 104 | country_Luxembourg | 0.0 |
| 42 | country_Congo, Dem. Rep. | 0.0 |
| 108 | country_Maldives | 0.0 |
| 173 | country_Turks and Caicos Islands | 0.0 |
| 141 | country_Romania | 0.0 |
| 114 | country_Middle East & North Africa | 0.0 |
| 184 | country_Zambia | 0.0 |
| 124 | country_Netherlands | 0.0 |
| 160 | country_St. Vincent and the Grenadines | 0.0 |
feature_imp = pd.DataFrame(sorted(zip(optimised_rf.feature_importances_,
pd.DataFrame(X_test).columns),
reverse=True),
columns=['Value','Feature'])
top_n = 20
plt.figure(1,figsize=(20, 8))
plt.subplot(1, 2, 2)
sns.barplot(x="Value", y="Feature", data=feature_imp.head(top_n))
plt.title('Top {} Variables by Importance'.format(top_n), fontsize=16)
plt.xlabel('Importance', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.show()
The results of the optimized random forest model demonstrate strong performance in classifying the affordability of food expenditure categories. The model achieves an impressive accuracy of 94.97%, with high precision (92.31%) and recall (93.75%) for the class "More than 37.42%." This indicates that the model effectively identifies individuals with higher food expenditures and minimizes false positives.
The classification report further supports the model's robustness, with high scores for precision, recall, and F1-score for both classes. The confusion matrix reveals that the model misclassifies only a small number of instances, with five false positives and four false negatives. This suggests a well-balanced trade-off between precision and recall.
The optimized random forest model outperforms both the optimized logistic regression and decision trees models, showcasing higher accuracy (94.97%) on the test dataset. The random forest model achieves an equilibrium between precision and recall for both classes, minimizing false positives and negatives. In contrast, the logistic regression model, while accurate, exhibits lower recall for the positive class, potentially indicating a bias towards false negatives. The decision trees model, although balanced, lags in overall accuracy.
Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks. SVM works by finding the optimal hyperplane that separates data into different classes while maximizing the margin between them. It's effective for both linear and non-linear relationships, thanks to kernel functions. SVM is widely used in various domains due to its versatility and ability to handle complex datasets.
Illustrative diagram :
model_svm = svm.SVC(
C=1.0,
kernel='rbf',
degree=3,
gamma='scale',
coef0=0.0,
shrinking=True,
probability=False,
tol=1e-3,
cache_size=200,
class_weight=None,
verbose=False,
max_iter=-1,
decision_function_shape='ovr'
)
model_svm.fit(X_train, y_train)
y_pred_train = model_svm.predict(X_train)
y_pred_val = model_svm.predict(X_val)
print("SVM Model Evaluation:\n")
print("Classification Report:")
print(classification_report(y_val, y_pred_val))
print("Confusion Matrix:\n")
print(confusion_matrix(y_val, y_pred_val))
accuracy_svm = accuracy_score(y_val, y_pred_val)
precision_svm = precision_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
recall_svm = recall_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
f1_svm = f1_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
print(f"\nAccuracy: {accuracy_svm:.4f}")
print(f"Precision: {precision_svm:.4f}")
print(f"Recall: {recall_svm:.4f}")
print(f"F1 Score: {f1_svm:.4f}")
y_prob_svm = model_svm.decision_function(X_val)
pos_label_svm = label_encoder.transform(['More than 37.42%'])[0]
y_val_encoded = label_encoder.transform(y_val)
false_positive_rate_svm, true_positive_rate_svm, thresholds_svm = roc_curve(y_val_encoded,
y_prob_svm,
pos_label=pos_label_svm)
roc_auc_svm = auc(false_positive_rate_svm, true_positive_rate_svm)
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_svm, true_positive_rate_svm, color='darkorange', lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_svm))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_svm:.4f}")
SVM Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.88 0.99 0.93 115
More than 37.42% 0.98 0.75 0.85 64
accuracy 0.91 179
macro avg 0.93 0.87 0.89 179
weighted avg 0.91 0.91 0.90 179
Confusion Matrix:
[[114 1]
[ 16 48]]
Accuracy: 0.9050
Precision: 0.9796
Recall: 0.7500
F1 Score: 0.8496
AUC: 0.9647
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
param_grid_svm = {
'C': [0.1, 1, 10], # Regularization parameter that controls the trade-off between smooth decision boundary and classifying training points correctly
'kernel': ['linear', 'rbf'], # Specifies the type of kernel used in the algorithm
'degree': [2, 3], # Degree of the polynomial kernel function ('poly')
'gamma': ['scale', 'auto'], # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'. 'scale' uses 1 / (n_features * X.var()), 'auto' uses 1 / n_features
'coef0': [0.0, 1.0], # Independent term in kernel function
'shrinking': [True, False], # Whether to use the shrinking heuristic. Helps speed up training
'probability': [True, False], # Whether to enable probability estimates. This is required for ROC-AUC scores
'tol': [1e-3, 1e-4], # Tolerance for stopping criterion. Controls the convergence of the optimization process
'class_weight': [None, 'balanced'], # Weights associated with classes. 'balanced' adjusts weights inversely proportional to class frequencies
'verbose': [True, False], # Controls the verbosity of the output during training
'max_iter': [100, 500], # Hard limit on iterations within solver. Helps prevent infinite loops
'decision_function_shape': ['ovo', 'ovr'], # Specifies the multi-class strategy if 'C' is more than 2: 'ovo' (one-vs-one) or 'ovr' (one-vs-the-rest).
}
svm_classifier = SVC(random_state=1)
model_svm2 = GridSearchCV(svm_classifier, param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
model_svm2.fit(X_train, y_train)
best_params_svm = model_svm2.best_params_
best_svm = model_svm2.best_estimator_
print("Best Parameters:", best_params_svm)
print("Best SVM:", best_svm)
[LibSVM]Best Parameters: {'C': 1, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovo', 'degree': 2, 'gamma': 'scale', 'kernel': 'linear', 'max_iter': 100, 'probability': True, 'shrinking': True, 'tol': 0.001, 'verbose': True}
Best SVM: SVC(C=1, decision_function_shape='ovo', degree=2, kernel='linear', max_iter=100,
probability=True, random_state=1, verbose=True)
# SVM model with suggested parameters
optimised_svm = svm.SVC(
C=1,
kernel='linear',
degree=2,
gamma='scale',
coef0=0.0,
shrinking=True,
probability=True,
tol=0.001,
class_weight=None,
verbose=True,
max_iter=100,
decision_function_shape='ovo'
)
optimised_svm.fit(X_train, y_train)
y_pred_train = optimised_svm.predict(X_train)
y_pred_val = optimised_svm.predict(X_val)
[LibSVM]
y_prob_svm = optimised_svm.decision_function(X_val)
pos_label_svm = label_encoder.transform(['More than 37.42%'])[0]
y_val_encoded = label_encoder.transform(y_val)
false_positive_rate_svm, true_positive_rate_svm, thresholds_svm = roc_curve(y_val_encoded,
y_prob_svm,
pos_label=pos_label_svm)
roc_auc_svm = auc(false_positive_rate_svm, true_positive_rate_svm)
print("Optimised SVM Model Evaluation:\n")
print("Classification Report:\n")
print(classification_report(y_val, y_pred_val))
print("Confusion Matrix:\n")
print(confusion_matrix(y_val, y_pred_val))
optimised_svm_acc = accuracy_score(y_val, y_pred_val)
# optimised_svm_cm = confusion_matrix(y_test_encoded, y_pred_test)
optimised_svm_precision = precision_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
optimised_svm_recall = recall_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
optimised_svm_f1 = f1_score(y_val, y_pred_val, pos_label='More than 37.42%', average='binary')
# optimised_svm_tpr = optimised_svm_cm[1][1] / (optimised_svm_cm[1][0] + optimised_svm_cm[1][1])
print(f"\nAccuracy: {optimised_svm_acc:.4f}")
print(f"Precision: {optimised_svm_precision:.4f}")
print(f"Sensitivity (Recall): {optimised_svm_recall:.4f}")
print(f"F1 Score: {optimised_svm_f1:.4f}")
# print('Sensitivity (TPR) =', optimised_svm_tpr)
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_svm,
true_positive_rate_svm,
color='darkorange',
lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_svm))
plt.plot([0, 1], [0, 1],
color='navy',
lw=2,
linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_svm:.4f}")
Optimised SVM Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.97 0.97 0.97 115
More than 37.42% 0.95 0.94 0.94 64
accuracy 0.96 179
macro avg 0.96 0.96 0.96 179
weighted avg 0.96 0.96 0.96 179
Confusion Matrix:
[[112 3]
[ 4 60]]
Accuracy: 0.9609
Precision: 0.9524
Sensitivity (Recall): 0.9375
F1 Score: 0.9449
AUC: 0.9800
The optimized SVM model demonstrates a good performance, achieving an accuracy of 96.09% on the test dataset. Precision for both classes is notably high, with values of 97% and 95% for less than or equal to 37.42% and more than 37.42%, respectively.
The *recall values are also commendable, particularly for the latter class, reaching 94%. The F1 score attains 94.49%.
The confusion matrix reveals only a few misclassifications, further substantiating the model's efficacy. Additionally, the AUC (Area Under the Curve) score, an indicator of the model's discriminatory power, is notably high at 98%, reinforcing the SVM model's robustness.
The optimized SVM model showcases superior performance compared to the optimized logistic regression, decision trees, and random forest models. With an accuracy of 96.09%, the SVM model surpasses its counterparts, emphasizing its robust predictive capabilities. Notably, the SVM model demonstrates a balanced precision of 95.24% and recall of 93.75%, excelling in minimizing false positives and negatives. The achieved F1 score of 94.49% underscores the model's harmonious balance between precision and recall, outperforming the logistic regression and decision trees models and aligning with the random forest model. Additionally, the SVM model attains a notable AUC score of 98.00%, indicating its robust discriminatory prowess, which exceeds the AUC scores of the logistic regression, decision trees, and random forest models.
XGBoost (eXtreme Gradient Boosting) is a decision tree-based machine learning algorithm, often used in classification and regression problems. It is an extension of traditional boosting that combines predictions from multiple weak models (shallow decision trees) to improve overall model performance. The algorithm uses a technique called "gradient boosting" to sequentially adjust models by minimizing a loss function. XGBoost also incorporates regularizations to prevent overfitting, a flexible objective function, and efficient handling of missing data.
Illustrative diagram :
import xgboost as xgb
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
xg_train = xgb.DMatrix(X_train, label=y_train_encoded)
y_val_encoded = label_encoder.transform(y_val)
xg_validation = xgb.DMatrix(X_val)
param_grid_xgb_regression = {
'eta': 0.1,
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'max_depth': 3,
'min_child_weight': 1,
'gamma': 0,
'subsample': 1,
'colsample_bytree': 1,
'reg_alpha': 0,
'reg_lambda': 1
}
model_xgboost = xgb.train(param_grid_xgb_regression, xg_train, num_boost_round=15)
y_pred_train = model_xgboost.predict(xg_train)
y_pred_val = model_xgboost.predict(xg_validation)
# Using inverse transform we are converting numerical predictions back to original string labels
y_pred_val_labels = label_encoder.inverse_transform(y_pred_val.astype(int))
# Using inverse transform, we are decoding numerical labels back to original values for evaluation
y_val_original = label_encoder.inverse_transform(y_val_encoded.astype(int))
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
import matplotlib.pyplot as plt
print("XGBoost Model Evaluation:\n")
print("Classification Report:\n")
print(classification_report(y_val_original, y_pred_val_labels))
print("Confusion Matrix:\n")
print(confusion_matrix(y_val_original, y_pred_val_labels))
accuracy_xgb = accuracy_score(y_val_original, y_pred_val_labels)
precision_xgb = precision_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
recall_xgb = recall_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
f1_xgb = f1_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
print(f"\nAccuracy: {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall: {recall_xgb:.4f}")
print(f"F1 Score: {f1_xgb:.4f}")
y_prob_xgb = model_xgboost.predict(xg_validation)
pos_label_xgb = label_encoder.transform(['More than 37.42%'])[0]
y_val_encoded = label_encoder.transform(y_val_original)
false_positive_rate_xgb, true_positive_rate_xgb, thresholds_xgb = roc_curve(y_val_encoded,
y_prob_xgb,
pos_label=pos_label_xgb)
roc_auc_xgb = auc(false_positive_rate_xgb, true_positive_rate_xgb)
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_xgb, true_positive_rate_xgb, color='darkorange', lw=2,
label='ROC curve (AUC = {:.2f})'.format(roc_auc_xgb))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_xgb:.4f}")
XGBoost Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.64 1.00 0.78 95
More than 37.42% 0.00 0.00 0.00 54
accuracy 0.64 149
macro avg 0.32 0.50 0.39 149
weighted avg 0.41 0.64 0.50 149
Confusion Matrix:
[[95 0]
[54 0]]
Accuracy: 0.6376
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000
AUC: 0.9568
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
xg_train = xgb.DMatrix(X_train.values, label=y_train_encoded, feature_names=list(X_train.columns))
param_grid_xgb_regression = {
'eta': [0.1, 0.2, 0.3], # Learning rate
'objective': ['reg:squarederror'], # Regression objective function
'eval_metric': ['rmse'], # Evaluation metric for regression
'max_depth': [3, 5, 7], # Maximum depth of a tree
'min_child_weight': [1, 3, 5], # Minimum sum of instance weight (hessian) needed in a child
'gamma': [0, 0.1, 0.2], # Minimum loss reduction required to make a further partition on a leaf node
'subsample': [0.8, 1.0], # Subsample ratio of the training instances
'colsample_bytree': [0.8, 1.0], # Subsample ratio of columns when constructing each tree
'reg_alpha': [0, 0.1, 0.5], # L1 regularization term on weights
'reg_lambda': [1, 1.5, 2] # L2 regularization term on weights
}
model_xgboost2 = xgb.XGBRegressor(
learning_rate=0.3,
objective='reg:squarederror',
n_estimators=40,
verbosity=0
)
grid_search_xgb_regression = GridSearchCV(model_xgboost2,
param_grid_xgb_regression,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1)
grid_search_xgb_regression.fit(X_train, y_train_encoded)
best_params_xgb_regression = grid_search_xgb_regression.best_params_
best_xgb_regression = grid_search_xgb_regression.best_estimator_
print("Best Parameters:", best_params_xgb_regression)
print("Best XGBoost Regressor:", best_xgb_regression)
y_pred_train_xgb_regression = best_xgb_regression.predict(X_train)
y_pred_val_xgb_regression = best_xgb_regression.predict(X_val)
mse_train = mean_squared_error(y_train_encoded, y_pred_train_xgb_regression)
mse_val = mean_squared_error(y_val_encoded, y_pred_val_xgb_regression)
print("Mean Squared Error on Training Set:", mse_train)
print("Mean Squared Error on Validation Set:", mse_val)
Best Parameters: {'colsample_bytree': 1.0, 'eta': 0.1, 'eval_metric': 'rmse', 'gamma': 0, 'max_depth': 7, 'min_child_weight': 1, 'objective': 'reg:squarederror', 'reg_alpha': 0, 'reg_lambda': 1, 'subsample': 1.0}
Best XGBoost Regressor: XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=1.0, early_stopping_rounds=None,
enable_categorical=False, eta=0.1, eval_metric='rmse',
feature_types=None, gamma=0, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.3, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=7,
max_leaves=None, min_child_weight=1, missing=nan,
monotone_constraints=None, n_estimators=40, n_jobs=None,
num_parallel_tree=None, predictor=None, ...)
Mean Squared Error on Training Set: 0.002336657212192746
Mean Squared Error on Validation Set: 0.010224422121232682
import xgboost as xgb
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
xg_train = xgb.DMatrix(X_train, label=y_train_encoded)
y_val_encoded = label_encoder.transform(y_val)
xg_validation = xgb.DMatrix(X_val)
param_grid_xgb_regression = {
'eta': 1,
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'max_depth': 15,
'min_child_weight': 5,
'gamma': 0.5,
'subsample': 1.0,
'colsample_bytree': 1.0,
'reg_alpha': 0.5,
'reg_lambda': 2
}
optimised_xgb = xgb.train(param_grid_xgb_regression, xg_train, num_boost_round=15)
y_pred_train = optimised_xgb.predict(xg_train)
y_pred_val = optimised_xgb.predict(xg_validation)
# Convert numerical predictions back to original string labels using inverse transform
y_pred_val_labels = label_encoder.inverse_transform(y_pred_val.astype(int))
# Decode numerical labels back to original values for evaluation
y_val_original = label_encoder.inverse_transform(y_val_encoded.astype(int))
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
y_prob_val = optimised_xgb.predict(xg_validation)
y_pred_val_labels = label_encoder.inverse_transform(y_prob_val.astype(int))
y_val_original = label_encoder.inverse_transform(y_val_encoded.astype(int))
pos_label_xgb = label_encoder.transform(['More than 37.42%'])[0]
false_positive_rate_xgb, true_positive_rate_xgb, thresholds_xgb = roc_curve(y_val_encoded, y_prob_val, pos_label=pos_label_xgb)
roc_auc_xgb = auc(false_positive_rate_xgb, true_positive_rate_xgb)
print("Optimised XGBoost Model Evaluation:\n")
print("Classification Report:\n")
print(classification_report(y_val_original, y_pred_val_labels))
print("Confusion Matrix:\n")
print(confusion_matrix(y_val_original, y_pred_val_labels))
accuracy_xgb = accuracy_score(y_val_original, y_pred_val_labels)
precision_xgb = precision_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
recall_xgb = recall_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
f1_xgb = f1_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
print(f"\nAccuracy: {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall: {recall_xgb:.4f}")
print(f"F1 Score: {f1_xgb:.4f}")
plt.figure(figsize=(10, 6))
plt.plot(false_positive_rate_xgb, true_positive_rate_xgb, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc_xgb:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
print(f"AUC: {roc_auc_xgb:.4f}")
Optimised XGBoost Model Evaluation:
Classification Report:
precision recall f1-score support
Less than or equal to 37.42% 0.76 1.00 0.86 95
More than 37.42% 1.00 0.44 0.62 54
accuracy 0.80 149
macro avg 0.88 0.72 0.74 149
weighted avg 0.85 0.80 0.77 149
Confusion Matrix:
[[95 0]
[30 24]]
Accuracy: 0.7987
Precision: 1.0000
Recall: 0.4444
F1 Score: 0.6154
AUC: 0.9602
Neural networks are machine learning models inspired by the functioning of the human brain. They are composed of layers of interconnected neurons that learn from data by adjusting the weight of the connections. In neural networks, there are generally three types of layers: input layers, hidden layers, and output layers. Neural networks can be used for classification tasks, regression, or even more complex problems like computer vision and natural language processing. Learning is often done by gradient backpropagation, where the network adjusts its weights based on the error between its predictions and the true target values.
Illustrative diagram :
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
label_encoder = LabelEncoder()
dummy_y_train = to_categorical(label_encoder.fit_transform(y_train))
dummy_y_val = to_categorical(label_encoder.transform(y_val))
model_nn = Sequential()
model_nn.add(Dense(32, input_dim=X_train.shape[1], activation='relu'))
model_nn.add(Dropout(0.2)) # Adjusted dropout rate
model_nn.add(Dense(32, activation='relu'))
model_nn.add(Dropout(0.2)) # Adjusted dropout rate
model_nn.add(Dense(len(label_encoder.classes_), activation='softmax')) # Output layer with the number of classes
model_nn.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
model_nn.fit(X_train, dummy_y_train, epochs=20, verbose=0)
y_prob_val = model_nn.predict(X_val)
y_pred_val = np.argmax(y_prob_val, axis=1)
WARNING:tensorflow:From C:\Users\CNITA\AppData\Local\anaconda3\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From C:\Users\CNITA\AppData\Local\anaconda3\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. WARNING:tensorflow:From C:\Users\CNITA\AppData\Local\anaconda3\Lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead. WARNING:tensorflow:From C:\Users\CNITA\AppData\Local\anaconda3\Lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead. 6/6 [==============================] - 0s 3ms/step
X_train, X_val, y_train, y_val = train_test_split(predictors, target, test_size=0.2, random_state=1)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
dummy_y_train = to_categorical(y_train_encoded)
dummy_y_val = to_categorical(label_encoder.transform(y_val))
y_prob_val = model_nn.predict(X_val)
y_pred_val = np.argmax(y_prob_val, axis=1)
mse_val = mean_squared_error(label_encoder.transform(y_val), y_pred_val)
print("Mean Squared Error on Validation Set:", mse_val)
print("\n Neural Networks Model Evaluation:")
print("\nAccuracy on the validation set:", metrics.accuracy_score(label_encoder.transform(y_val), y_pred_val))
print("\nClassification Report:\n")
print(metrics.classification_report(label_encoder.transform(y_val), y_pred_val))
print("Confusion Matrix:\n")
print(metrics.confusion_matrix(label_encoder.transform(y_val), y_pred_val))
roc_auc_values = []
plt.figure(figsize=(8, 5))
for i in range(len(label_encoder.classes_)):
y_true_class_i = (label_encoder.transform(y_val) == i).astype(int)
y_prob_class_i = y_prob_val[:, i]
false_positive_rate, true_positive_rate, _ = roc_curve(y_true_class_i, y_prob_class_i)
roc_auc = roc_auc_score(y_true_class_i, y_prob_class_i)
roc_auc_values.append(roc_auc)
plt.plot(false_positive_rate, true_positive_rate, lw=2, label=f'Class {label_encoder.classes_[i]} (AUC = {roc_auc:.2f})')
false_positive_rate_avg, true_positive_rate_avg, _ = roc_curve(label_encoder.transform(y_val), y_prob_val[:, 1])
roc_auc_avg = np.mean(roc_auc_values)
plt.plot(false_positive_rate_avg, true_positive_rate_avg, color='darkorange', linestyle='--', lw=2,
label=f'Average ROC Curve (AUC = {roc_auc_avg:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc="lower right")
plt.show()
print(f"Average AUC across all classes: {roc_auc_avg:.4f}")
5/5 [==============================] - 0s 3ms/step
Mean Squared Error on Validation Set: 0.06711409395973154
Neural Networks Model Evaluation:
Accuracy on the validation set: 0.9328859060402684
Classification Report:
precision recall f1-score support
0 0.92 0.98 0.95 95
1 0.96 0.85 0.90 54
accuracy 0.93 149
macro avg 0.94 0.92 0.93 149
weighted avg 0.93 0.93 0.93 149
Confusion Matrix:
[[93 2]
[ 8 46]]
Average AUC across all classes: 0.9702
label_encoder = LabelEncoder()
dummy_y_train = to_categorical(label_encoder.fit_transform(y_train))
dummy_y_val = to_categorical(label_encoder.transform(y_val))
def build_model(hp):
model_nn2 = Sequential()
model_nn2.add(Dense(units=hp.Int('units_1', min_value=16, max_value=128, step=16),
input_dim=X_train.shape[1], activation=hp.Choice('activation_1', values=['relu', 'tanh'])))
model_nn2.add(Dropout(rate=hp.Float('dropout_1', min_value=0.2, max_value=0.5, step=0.1)))
model_nn2.add(Dense(units=hp.Int('units_2', min_value=16, max_value=128, step=16),
activation=hp.Choice('activation_2', values=['relu', 'tanh'])))
model_nn2.add(Dropout(rate=hp.Float('dropout_2', min_value=0.2, max_value=0.5, step=0.1)))
model_nn2.add(Dense(len(label_encoder.classes_), activation='softmax'))
model_nn2.compile(optimizer=Adam(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')),
loss='mean_squared_error', metrics=['mae'])
return model_nn2
tuning1 = GridSearch(
build_model,
objective='val_loss',
hyperparameters={
'units_1': [8, 16, 32, 64, 128],
'activation_1': ['relu', 'tanh'],
'dropout_1': [0.2, 0.3, 0.4, 0.5],
'units_2': [8, 16, 32, 64, 128],
'activation_2': ['relu', 'tanh'],
'dropout_2': [0.2, 0.3, 0.4, 0.5],
'learning_rate': [1e-4, 1e-3, 1e-2],
},
directory='my_tuning_directory',
project_name='neural_network_tuning'
)
early_stopping = EarlyStopping(monitor='val_loss', patience=5)
tuning1.search(X_train, dummy_y_train, epochs=20, validation_data=(X_val, dummy_y_val), callbacks=[early_stopping])
best_hyperparameters = tuning1.oracle.get_best_trials(num_trials=1)[0].hyperparameters
Reloading Tuner from my_tuning_directory\neural_network_tuning\tuner0.json
optimised_nn = build_model(best_hyperparameters)
optimised_nn.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
optimised_nn.fit(X_train, dummy_y_train, epochs=20, verbose=0)
y_prob_val = optimised_nn.predict(X_val)
y_pred_val = np.argmax(y_prob_val, axis=1)
5/5 [==============================] - 0s 4ms/step
print(optimised_nn.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 96) 20256
dropout_2 (Dropout) (None, 96) 0
dense_4 (Dense) (None, 112) 10864
dropout_3 (Dropout) (None, 112) 0
dense_5 (Dense) (None, 2) 226
=================================================================
Total params: 31346 (122.45 KB)
Trainable params: 31346 (122.45 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
optimised_nn.evaluate(X_val, dummy_y_val)
5/5 [==============================] - 0s 5ms/step - loss: 0.0074 - mae: 0.0227
[0.007447345182299614, 0.02274118922650814]
X_train, X_val, y_train, y_val = train_test_split(predictors, target, test_size=0.2, random_state=1)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
dummy_y_train = to_categorical(y_train_encoded)
dummy_y_val = to_categorical(label_encoder.transform(y_val))
y_prob_val = model_nn.predict(X_val)
y_pred_val = np.argmax(y_prob_val, axis=1)
# mse_val = mean_squared_error(label_encoder.transform(y_val), y_pred_val)
print("\nOptimised Neural Networks Model Evaluation:")
print("\nAccuracy on the validation set:", metrics.accuracy_score(label_encoder.transform(y_val), y_pred_val))
print("\nClassification Report:\n")
print(metrics.classification_report(label_encoder.transform(y_val), y_pred_val))
print("Confusion Matrix:\n")
print(metrics.confusion_matrix(label_encoder.transform(y_val), y_pred_val))
roc_auc_values = []
plt.figure(figsize=(8, 5))
for i in range(len(label_encoder.classes_)):
y_true_class_i = (label_encoder.transform(y_val) == i).astype(int)
y_prob_class_i = y_prob_val[:, i]
false_positive_rate, true_positive_rate, _ = roc_curve(y_true_class_i, y_prob_class_i)
roc_auc = roc_auc_score(y_true_class_i, y_prob_class_i)
roc_auc_values.append(roc_auc)
plt.plot(false_positive_rate, true_positive_rate, lw=2, label=f'Class {label_encoder.classes_[i]} (AUC = {roc_auc:.2f})')
optimised_nn_acc = accuracy_score(y_val_original, y_pred_val_labels)
# optimised_nn_cm = confusion_matrix(y_test_encoded, y_pred_test)
optimised_nn_precision = precision_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
optimised_nn_recall = recall_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
optimised_nn_f1 = f1_score(y_val_original, y_pred_val_labels, pos_label='More than 37.42%', average='binary')
# optimised_nn_tpr = optimised_nn_acc[1][1] / (optimised_nn_acc[1][0] + optimised_nn_acc[1][1])
print(f"\nAccuracy: {optimised_nn_acc:.4f}")
print(f"Precision: {optimised_nn_precision:.4f}")
print(f"Sensitivity (Recall): {optimised_nn_recall:.4f}")
print(f"F1 Score: {optimised_nn_f1:.4f}")
# print("Sensitivity (TPR) =", optimised_nn_tpr)
false_positive_rate_avg, true_positive_rate_avg, _ = roc_curve(label_encoder.transform(y_val), y_prob_val[:, 1])
roc_auc_avg = np.mean(roc_auc_values)
plt.plot(false_positive_rate_avg, true_positive_rate_avg, color='darkorange', linestyle='--', lw=2,
label=f'Average ROC Curve (AUC = {roc_auc_avg:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc="lower right")
plt.show()
print(f"Average AUC across all classes: {roc_auc_avg:.4f}")
5/5 [==============================] - 0s 3ms/step
Optimised Neural Networks Model Evaluation:
Accuracy on the validation set: 0.9328859060402684
Classification Report:
precision recall f1-score support
0 0.92 0.98 0.95 95
1 0.96 0.85 0.90 54
accuracy 0.93 149
macro avg 0.94 0.92 0.93 149
weighted avg 0.93 0.93 0.93 149
Confusion Matrix:
[[93 2]
[ 8 46]]
Accuracy: 0.6425
Precision: 0.0000
Sensitivity (Recall): 0.0000
F1 Score: 0.0000
Average AUC across all classes: 0.9702
X_train, X_val, y_train, y_val = train_test_split(predictors, target, test_size=0.2, random_state=1)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
dummy_y_train = to_categorical(y_train_encoded)
dummy_y_val = to_categorical(label_encoder.transform(y_val))
y_prob_val = model_nn.predict(X_val)
y_pred_val = np.argmax(y_prob_val, axis=1)
optimised_nn_precision = metrics.precision_score(label_encoder.transform(y_val), y_pred_val, average='weighted')
optimised_nn_recall = metrics.recall_score(label_encoder.transform(y_val), y_pred_val, average='weighted')
optimised_nn_f1 = metrics.f1_score(label_encoder.transform(y_val), y_pred_val, average='weighted')
print("\nOptimised Neural Networks Model Evaluation:")
print("\nAccuracy on the validation set:", metrics.accuracy_score(label_encoder.transform(y_val), y_pred_val))
print("\nClassification Report:\n")
print(metrics.classification_report(label_encoder.transform(y_val), y_pred_val))
print("Confusion Matrix:\n")
print(metrics.confusion_matrix(label_encoder.transform(y_val), y_pred_val))
roc_auc_values = []
plt.figure(figsize=(8, 5))
for i in range(len(label_encoder.classes_)):
y_true_class_i = (label_encoder.transform(y_val) == i).astype(int)
y_prob_class_i = y_prob_val[:, i]
false_positive_rate, true_positive_rate, _ = metrics.roc_curve(y_true_class_i, y_prob_class_i)
roc_auc = metrics.roc_auc_score(y_true_class_i, y_prob_class_i)
roc_auc_values.append(roc_auc)
plt.plot(false_positive_rate, true_positive_rate, lw=2, label=f'Class {label_encoder.classes_[i]} (AUC = {roc_auc:.2f})')
optimised_nn_acc = metrics.accuracy_score(label_encoder.transform(y_val), y_pred_val)
roc_auc_avg = np.mean(roc_auc_values)
print(f"\nAccuracy: {optimised_nn_acc:.4f}")
print(f"Precision: {optimised_nn_precision:.4f}")
print(f"Sensitivity (Recall): {optimised_nn_recall:.4f}")
print(f"F1 Score: {optimised_nn_f1:.4f}")
print(f"Average AUC across all classes: {roc_auc_avg:.4f}")
false_positive_rate_avg, true_positive_rate_avg, _ = metrics.roc_curve(label_encoder.transform(y_val), y_prob_val[:, 1])
plt.plot(false_positive_rate_avg, true_positive_rate_avg, color='darkorange', linestyle='--', lw=2,
label=f'Average ROC Curve (AUC = {roc_auc_avg:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend(loc="lower right")
plt.show()
5/5 [==============================] - 0s 2ms/step
Optimised Neural Networks Model Evaluation:
Accuracy on the validation set: 0.9328859060402684
Classification Report:
precision recall f1-score support
0 0.92 0.98 0.95 95
1 0.96 0.85 0.90 54
accuracy 0.93 149
macro avg 0.94 0.92 0.93 149
weighted avg 0.93 0.93 0.93 149
Confusion Matrix:
[[93 2]
[ 8 46]]
Accuracy: 0.9329
Precision: 0.9344
Sensitivity (Recall): 0.9329
F1 Score: 0.9319
Average AUC across all classes: 0.9702
# Collecting results
tpr_weight = 0.25
models = pd.DataFrame({
'Model': ['Logistic Regression', 'Decision Trees', 'Random Forest',
'SVM', ' XGBoost', 'Neural Networks'],
'Accuracy': [optimised_logreg_acc, optimised_dtree_acc, optimised_rf_acc,
optimised_svm_acc, optimised_xgb_acc, optimised_nn_acc],
'Sensitivity' : [optimised_logreg_recall, optimised_dtree_recall, optimised_rf_recall,
optimised_svm_recall, optimised_xgb_recall, optimised_nn_recall],
'Objective Value' : [optimised_logreg_acc + optimised_logreg_recall * tpr_weight,
optimised_dtree_acc + optimised_dtree_recall * tpr_weight,
optimised_rf_acc + optimised_rf_recall * tpr_weight,
optimised_svm_acc + optimised_svm_recall * tpr_weight,
optimised_xgb_acc + optimised_xgb_recall * tpr_weight,
optimised_nn_acc + optimised_nn_recall * tpr_weight] })
models
| Model | Accuracy | Sensitivity | Objective Value | |
|---|---|---|---|---|
| 0 | Logistic Regression | 0.993289 | 0.981132 | 1.238572 |
| 1 | Decision Trees | 0.879195 | 0.735849 | 1.063157 |
| 2 | Random Forest | 0.949721 | 0.937500 | 1.184096 |
| 3 | SVM | 0.960894 | 0.937500 | 1.195269 |
| 4 | XGBoost | 0.798658 | 0.981481 | 1.044028 |
| 5 | Neural Networks | 0.932886 | 0.932886 | 1.166107 |
# Collection results for plotting
df_plotting = pd.DataFrame({
'Model': ['Logistic Regression', 'Decision Trees',
'Random Forest', 'SVM',
'XGBoost', 'Neural Networks',
'Logistic Regression', 'Decision Trees',
'Random Forest', 'SVM',
'XGBoost', 'Neural Networks'],
'Values': [optimised_logreg_acc, optimised_dtree_acc,
optimised_rf_acc, optimised_svm_acc,
optimised_xgb_acc, optimised_nn_acc,
optimised_logreg_acc, optimised_dtree_acc,
optimised_rf_acc, optimised_svm_acc,
optimised_xgb_acc, optimised_nn_acc],
'Type': ['Accuracy', 'Accuracy',
'Accuracy','Accuracy',
'Accuracy','Accuracy',
'Sensitivity','Sensitivity',
'Sensitivity', 'Sensitivity',
'Sensitivity', 'Sensitivity']})
# Plotting the results comparison
plt.figure(figsize=(15,10))
sns.barplot(y=df_plotting['Values'],
x=df_plotting['Type'],
hue=df_plotting['Model'],
orient ="v")
plt.ylim(min(df_plotting['Values']) * 0.9,
max(df_plotting['Values']) * 1.01)
plt.show()
# Plotting the results comparison
plt.figure(figsize=(20,5))
sns.barplot(y=models['Objective Value'],
x=models['Model'],
orient ="v")
plt.ylim(min(models['Objective Value']) * 0.90,
max(models['Objective Value'])*1.01)
plt.show()
The choice of the model depends on the specific priorities and requirements of your study. Here are considerations for each model based on the provided evaluations:
Optimized Neural Network:
Optimized Logistic Regression:
Optimized Decision Trees:
Optimized Random Forest:
Optimized SVM:
Optimized XGBoost:
Final choice: Considering the trade-off between performance and interpretability, our choice is leaning towards the Optimized Neural Network or the Optimized Logistic Regression. If computational efficiency is a priority and interpretability is crucial, logistic regression might be preferred. However,when prioritising maximizing predictive performance, the neural network could be a strong contender. It's advisable to further validate and fine-tune the selected model on additional datasets or through cross-validation to ensure robust generalization.
Integrating ChatGPT into our Machine Learning project could be done in several phases of the process. first of all the model we use is GPT 3.5.
Regarding the creation of a prototype, we use a user interface where ChatGPT guides users through the interpretation of the model results or provides additional information about the data. ChatGPT integration would add value by facilitating communication between technical and non-technical experts, improving model transparency and overall project understanding.
To do this :
#!pip install openai
#from openai import OpenAI
!pip install openai
import openai
openai.api_key = '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'
# Function to interact with ChatGPT
def chat_with_gpt(prompt):
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=150,
temperature=0.7
)
return response.choices[0].text.strip()
# Example usage
user_input = input("Question: ")
while user_input.lower() != 'exit':
prompt = f"You: {user_input}\nChatGPT:"
response = chat_with_gpt(prompt)
print(f"ChatGPT: {response}\n")
user_input = input("Question: ")
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: openai in c:\users\andre\appdata\roaming\python\python311\site-packages (1.7.2) Requirement already satisfied: anyio<5,>=3.5.0 in c:\programdata\anaconda3\lib\site-packages (from openai) (3.5.0) Requirement already satisfied: distro<2,>=1.7.0 in c:\users\andre\appdata\roaming\python\python311\site-packages (from openai) (1.9.0) Requirement already satisfied: httpx<1,>=0.23.0 in c:\users\andre\appdata\roaming\python\python311\site-packages (from openai) (0.26.0) Requirement already satisfied: pydantic<3,>=1.9.0 in c:\programdata\anaconda3\lib\site-packages (from openai) (1.10.8) Requirement already satisfied: sniffio in c:\programdata\anaconda3\lib\site-packages (from openai) (1.2.0) Requirement already satisfied: tqdm>4 in c:\programdata\anaconda3\lib\site-packages (from openai) (4.65.0) Requirement already satisfied: typing-extensions<5,>=4.7 in c:\programdata\anaconda3\lib\site-packages (from openai) (4.7.1) Requirement already satisfied: idna>=2.8 in c:\programdata\anaconda3\lib\site-packages (from anyio<5,>=3.5.0->openai) (3.4) Requirement already satisfied: certifi in c:\programdata\anaconda3\lib\site-packages (from httpx<1,>=0.23.0->openai) (2023.11.17) Requirement already satisfied: httpcore==1.* in c:\users\andre\appdata\roaming\python\python311\site-packages (from httpx<1,>=0.23.0->openai) (1.0.2) Requirement already satisfied: h11<0.15,>=0.13 in c:\users\andre\appdata\roaming\python\python311\site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0) Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from tqdm>4->openai) (0.4.6) Question: Act as a Data Scientist and interpret an AUC of 0.80
--------------------------------------------------------------------------- APIRemovedInV1 Traceback (most recent call last) Cell In[31], line 22 18 while user_input.lower() != 'exit': 20 prompt = f"You: {user_input}\nChatGPT:" ---> 22 response = chat_with_gpt(prompt) 24 print(f"ChatGPT: {response}\n") 26 user_input = input("Question: ") Cell In[31], line 8, in chat_with_gpt(prompt) 7 def chat_with_gpt(prompt): ----> 8 response = openai.Completion.create( 9 engine="text-davinci-003", 10 prompt=prompt, 11 max_tokens=150, 12 temperature=0.7 13 ) 14 return response.choices[0].text.strip() File ~\AppData\Roaming\Python\Python311\site-packages\openai\lib\_old_api.py:39, in APIRemovedInV1Proxy.__call__(self, *_args, **_kwargs) 38 def __call__(self, *_args: Any, **_kwargs: Any) -> Any: ---> 39 raise APIRemovedInV1(symbol=self._symbol) APIRemovedInV1: You tried to access openai.Completion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API. You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28` A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742
An important ethical issue is the possibility of manipulating users for malicious purposes. Conversational AIs could be used to spread false information, incite hatred, or exploit individuals' psychological vulnerabilities. The design and deployment of ChatGPT therefore requires careful consideration of how to prevent these risks and promote a healthy and caring online environment.
Additionally, ChatGPT is capable of generating coherent and relevant text to answer a variety of queries. This gives it considerable power of influence over users. When people regularly interact with a conversational AI system, they may develop trust in it, perceiving it as a reliable source of knowledge. This raises the question of ChatGPT’s responsibility for spreading erroneous or biased information.
The use of ChatGPT was useful in the project because it allowed us, among other things, to optimize our code. Indeed, ChatGPT is very useful in programming, it adds the ability to explain code, even code that has been intentionally obfuscated. It can be used to analyze human code for security flaws.
Challenges encountered with using the ChatGPT tool include managing bias in responses generated by ChatGPT. Indeed, this involves constant monitoring of responses generated by ChatGPT, adjustment of model parameters, and application of pre- and post-processing techniques to ensure the ethics and security of the project.
Also, there is the fact that we users often share personal and sensitive information. Protecting the confidentiality of this data is crucial to avoid potential abuse. Companies deploying conversational AI must put robust measures in place to ensure user data is stored and used securely and responsibly.